Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spans not nesting correctly when calling auto-instrumented functions from manually instrumented Kotlin coroutines #7124

Closed
mpeyper opened this issue Nov 10, 2022 · 15 comments · Fixed by #7341
Labels
bug Something isn't working

Comments

@mpeyper
Copy link

mpeyper commented Nov 10, 2022

Describe the bug
There are existing issues about @WithSpan not working with Kotlin Coroutines (#6502, #6542, #6776, #6887), but I think this issue is slightly different than those. Perhaps they are the same or similar issues, but I don't know enough to be sure.

Essentially, I've got an auto-instrumented span (io.opentelemetry.netty) calling a GRPC handler that calls a coroutine that has been manually instrumented with KotlinContextElement (io.opentelemetry.extension.kotlin) and within that coroutine it calls some regular functions that have been auto-instrumented with @WithSpan (io.opentelemetry.instrumentation.annotations).

When the spans are reported, the coroutine spans are correctly nested under the root netty span, however, the @WithSpan span is not nesting under the coroutine span, instead appearing as a sibling under the netty span.

If the code is change to remove coroutines and just use @WithSpan the the nesting behaves correctly. Similarly, if the coroutines call other manually instrumented coroutines, the nesting also works as expected

Steps to reproduce
This is a contrived example, but this class demonstrates the issue:

class PingHandler {
  fun ping(): String {
    return "pong"
  }

  @WithSpan
  fun slowPingSync() {
    Thread.sleep(10)
    ping()
  }

  suspend fun slowPingAsync() = withSpan {
    delay(10)
    ping()
  }

  @WithSpan
  fun pingSync() {
    Thread.sleep(20)
    slowPingSync()
  }

  suspend fun pingAsync() = withSpan {
    delay(20)
    slowPingAsync()
  }

  suspend fun pingMixedAsync() = withSpan {
    pingSync()
    pingAsync()
    slowPingSync()
    slowPingAsync()
  }

  @WithSpan
  fun pingMixedSync() {
    pingSync()
    traceBlocking {
      pingAsync()
    }
    slowPingSync()
    traceBlocking {
      slowPingAsync()
    }
  }
}

where withSpan and traceBlocking look like this:

suspend inline fun <Result> withSpan(
  name: String = getDefaultSpanName(),
  crossinline block: suspend (span: Span?) -> Result
): Result {
  val localTracer = GlobalOpenTelemetry.getTracer(object{}.javaClass.packageName)
  val span: Span = localTracer.spanBuilder(name).run {
    setParent(coroutineContext.getOpenTelemetryContext())
    coroutineContext[CoroutineName]?.let { setAttribute("coroutine.name", it.name) }
    startSpan()
  }

  return withContext(span.asContextElement()) {
    try {
      block(span)
    } catch (throwable: Throwable) {
      span.setStatus(StatusCode.ERROR)
      span.recordException(throwable)
      throw throwable
    } finally {
      span.end()
    }
  }
}

@Suppress("NOTHING_TO_INLINE") // inlining to remove this function from the stack trace
inline fun getDefaultSpanName(): String {
  val callingStackFrame = Thread.currentThread().stackTrace[1]

  val simpleClassName = Class.forName(callingStackFrame.className).simpleName
  val methodName = callingStackFrame.methodName

  return "$simpleClassName.$methodName"
}

inline fun <Result> traceBlocking(crossinline block: suspend (span: Span?) -> Result): Result {
  val span = Span.current()
  return runBlocking(span.asContextElement()) {
    block(span)
  }
}

and calling it like this:

class PingGrpc(private val pingHandler: PingHandler) : PingServiceGrpcKt.PingServiceCoroutineImplBase() {
  override suspend fun ping(request: PingRequest): PingResponse = coroutineScope {
    pingHandler.pingSync()
    pingHandler.pingAsync()
    pingHandler.pingMixedSync()
    pingHandler.pingMixedAsync()
    PingResponse.newBuilder().setMessage(pingHandler.ping()).build()
  }
}

Note: this was added to a private codebase using Armeria and GRPC for debugging, but if it's not helpful in this form, I can try to set up a smaller reproduction for you.

What did you expect to see?

Given the calls above I would expect to see traces nested something like this:

HTTP POST
  -> pingSync
    -> slowPingSync
  -> pingAsync
    -> slowPingAsync
  -> pingMixedSync
    -> pingSync
      -> slowPingSync
    -> pingAsync
      -> slowPingAsync
    -> slowPingSync
    -> slowPingAsync
  -> pingMixedAsync
    -> pingSync
      -> slowPingSync
    -> pingAsync
      -> slowPingAsync
    -> slowPingSync
    -> slowPingAsync

What did you see instead?

Any ...Sync span that should be nested below an ...Async span is instead a sibling instead of a child.

HTTP POST
  -> pingSync
    -> slowPingSync
  -> pingAsync
    -> slowPingAsync
  -> pingMixedSync
    -> pingSync
      -> slowPingSync
    -> pingAsync
      -> slowPingAsync
    -> slowPingSync
    -> slowPingAsync
  -> pingMixedAsync
    -> pingAsync
      -> slowPingAsync
    -> slowPingAsync 
  -> pingSync        <--------------- this should be nested under pingMixedAsync
    -> slowPingSync
  -> slowPingSync    <--------------- so should this

image

Note that only pingMixedAsync is different from what I expect here. The others work correctly as far as I can tell, I just left them in for completeness of what I have attempted.

What version are you using?

opentelemetry-javaagent: 1.19.2

Environment
Compiler: openjdk 17.0.2 2022-01-18
OS: MacOS Ventura 13.0

Additional context
N/A

@mpeyper mpeyper added the bug Something isn't working label Nov 10, 2022
@laurit
Copy link
Contributor

laurit commented Nov 10, 2022

Probably the issue is that pingSync that uses @WithSpan reads opentelemetry context from a ThreadLocal but withSpan that you use with coroutines stores opentelemetry context in coroutine context. To make pingSync appear under pingMixedAsync you should set the opentelemetry context that you have in coroutine context into the thread local. Maybe something like

coroutineContext.getOpenTelemetryContext().makeCurrent().use { 
  pingSync()
}

@mpeyper
Copy link
Author

mpeyper commented Nov 10, 2022

I may be naive, but I thought that the purpose of KotlinContextElement was to restore the thread local context when the coroutine resumes.

That said, using your suggestion does work to some degree. Changing pingMixedAsync to the following does down the correct trace:

  suspend fun pingMixedAsync() = withSpan {
    coroutineContext.getOpenTelemetryContext().makeCurrent().use {
      pingSync()
    }
    pingAsync()
    coroutineContext.getOpenTelemetryContext().makeCurrent().use {
      slowPingSync()
    }
    slowPingAsync()
  }

However, this is rather inconvenient to have to wrap every possible auto-instrumented call when called in the context of a coroutine. I tried to make it this instead with the hopes of moving it into withSpan and have it just work:

  suspend fun pingMixedAsync() = withSpan {
    coroutineContext.getOpenTelemetryContext().makeCurrent().use {
      pingSync()
      pingAsync()
      slowPingSync()
      slowPingAsync()
    }
  }

With the setup, pingSync, pingAsync and slowPingAsync are all nested correctly, but slowPingAsync is under the root span still. It seems any auto-instrumented span made after the first suspension gets the wrong parent again.

To make it a little more concrete, take something like this:

class MyHandler {
  private val client = OkHttpClient()

  private fun getFromApi(): String {
    val request = Request.Builder()
      .url("https://some.remote.resource.json")
      .build();

    return client.newCall(request).execute().use {
      it.body?.string() ?: ""
    }
  }

  @WithSpan
  fun getFromApiAuto(): String {
    // some work before fetching data
    return getFromApi()
  }

  suspend fun getFromApiSuspend() = withSpan {
    // some work that suspends before fetching data
    getFromApi()
  }
}

Calling getFromApiAuto works exactly as expected, but getFromApiSuspend doesn't and I'm not sure it can without having a deep understanding of everything and adding a bit on manual wrappers around specific calls. It just feels like tracing through Kotlin coroutines is just very hit and miss right now.

@andrewloux
Copy link

Same issue as #6502 (comment)

@laurit
Copy link
Contributor

laurit commented Nov 27, 2022

I may be naive, but I thought that the purpose of KotlinContextElement was to restore the thread local context when the coroutine resumes.

I think the problem is in kotlin coroutine instrumentation, disabling it with -Dotel.instrumentation.kotlinx-coroutines.enabled=false seems to give better results. When it is disabled then calling makeCurrent doesn't seem to be needed, the thread local context gets updated when opentelemetry context is updated in the coroutine context.

@mpeyper
Copy link
Author

mpeyper commented Nov 27, 2022

Just to clarify @laurit, are you saying that it should work with kotlin coroutine instrumentation enabled but it's not so there is a bug to fix there, or that we should not be using the kotlin coroutine instrumentation when trying to trace around coroutines? (I'm assuming the former, but I'm fairly new to this world or OpenTelemetry and struggling to read between the lines)

@laurit
Copy link
Contributor

laurit commented Nov 28, 2022

@mpeyper I believe that there is a bug.

@laurit
Copy link
Contributor

laurit commented Nov 28, 2022

When you disable coroutine instrumentation it is possible that you'll need to use something like withContext(Context.current().asContextElement()) {...} to get spans in coroutine under the parent span that is outside of the coroutine. Let me know whether you get better traces with the instrumentation disabled.

@mpeyper
Copy link
Author

mpeyper commented Nov 28, 2022

Yes, the traces are as I expect them with the instrumentation disabled:

  • With instrumentation
    image
  • Without instrumentation
    image

Note that I did not have to change any of the code shared above to get these traces, I just added -Dotel.instrumentation.kotlinx-coroutines.enabled=false to the start command.

I guess the obvious question is what are we missing by disabling the kotlin coroutine instrumentation? While these traces are now accurate, will there be others that aren't, or previously working cases that will now need manual instrumentation?

@trask
Copy link
Member

trask commented Nov 29, 2022

@mpeyper I believe that there is a bug.

@mpeyper can you create a minimal repro that we can use to troubleshoot the issue?

@laurit
Copy link
Contributor

laurit commented Nov 30, 2022

I guess the obvious question is what are we missing by disabling the kotlin coroutine instrumentation? While these traces are now accurate, will there be others that aren't, or previously working cases that will now need manual instrumentation?

When coroutine instrumentation is disabled you may need to manually propagate the parent span that is outside of the coroutine with withContext. Hopefully #7341 fixes this so you won't need to disable the instrumentation.

@mpeyper
Copy link
Author

mpeyper commented Dec 6, 2022

@trask @laurit I finally got around to making the repro for you:

https://github.com/mpeyper/open-telemetry-coroutine-tracing-repro

Let me know if there is anything else I can do to help.

trask pushed a commit that referenced this issue Dec 6, 2022
Hopefully resolves
#7124
Our kotlin coroutine instrumentation relies on a shaded copy of
`opentelemetry-extension-kotlin`. This doesn't work well when
application also uses `opentelemetry-extension-kotlin`, because the
shaded and unshaded copy store opentelemery context under different key.
This pr attempts to fix this by instrumenting
`opentelemetry-extension-kotlin` provided by the application so that it
would delegate to the one shaded inside the agent.

Co-authored-by: Mateusz Rzeszutek <mrzeszutek@splunk.com>
@trask
Copy link
Member

trask commented Dec 6, 2022

@mpeyper can you try with the latest SNAPSHOT build in a couple of hours once the new SNAPSHOT build with #7341 has been published?

@mpeyper
Copy link
Author

mpeyper commented Dec 6, 2022

@trask I ran by repro test with the opentelemetry-javaagent-1.21.0-20221206.221310-88.jar build and the traces are produced correctly 🎉

Results can be seen here.

@trask
Copy link
Member

trask commented Dec 7, 2022

thanks @mpeyper @laurit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants