Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebClient/Netty instrumentation produces broken traces on Kubernetes #2229

Closed
flangknecht opened this issue Feb 9, 2021 · 3 comments
Closed
Labels
bug Something isn't working

Comments

@flangknecht
Copy link

Describe the bug

Using Spring's WebClient, the first http request using "fresh" netty channels fails to propagate coherent parent trace IDs when run on Kubernetes.

Netty seems to also create a new channel every 100 requests (at that time, OTel has to re-instrument some lambdas).
Broken traces can then also be encountered.

With debug output enabled for the OTel agent, you can see the following log lines when the issue occurs.

[opentelemetry.auto.trace 2021-02-02 12:45:36:047 +0000] [http-nio-8080-exec-2] DEBUG io.opentelemetry.javaagent.instrumentation.api.concurrent.RunnableWrapper - Wrapping runnable task reactor.netty.transport.TransportConnector$MonoChannelPromise$$Lambda$1167/0x0000000800a14040@7af407fd
[opentelemetry.auto.trace 2021-02-02 12:45:36:051 +0000] [reactor-httpepoll-1] DEBUG io.opentelemetry.javaagent.instrumentation.api.concurrent.RunnableWrapper - Wrapping runnable task reactor.netty.transport.TransportConnector$$Lambda$1173/x0000000800a47040@17d34d93
[opentelemetry.auto.trace 2021-02-02 12:45:36:051 +0000] [reactor-httpepoll-1] DEBUG io.opentelemetry.javaagent.instrumentation.api.concurrent.RunnableWrapper - Wrapping runnable task reactor.netty.transport.TransportConnector$$Lambda$1173/0x0000000800a47040@17d34d93

Replacing WebClient with RestTemplate fixes this issue and the service produces proper spans from the get-go.

After some debugging the javaagent-instrumentation, I've found that for affected requests the parentContext that's read from a freshly instrumented channel is not null as the HttpClientRequestTracingHandler expects, but an empty instance of ArrayBasedContext.

K3s Minikube docker-compose
empty, but non-null parent context empty, but non-null parent context properly populated parent context
K3s debug Minikube debug docker-compose debug
K3s Trace Minikube Trace docker-compose trace

Steps to reproduce

I've provided a sample project including kubernetes manifests and docker-compose descriptor at https://github.com/flangknecht/opentelemetry-webclient-k8s

So far, I could only reproduce this issue when the applications were deployed to kubernetes.
I've tried it on 3 different kubernetes distributions as well as in a docker-compose setup.

  1. Clone repo and cd into local copy
  2. Deploy to Kubernetes (kubectl apply -f k8s) and create appropriate port forwards to your workstation
    1. kubectl port-forward jaeger-... 16686
    2. kubectl port-forward opentelemetry-service-one-... 8080 1044
  3. Call web service 1 (HTTP GET localhost:8080/)
    • Check jaeger-ui for traces for service 1 and see broken trace
  4. Make another call to web service 1 within a short period of time (roughly less than 2 minutes)
    • Check jaeger-ui again, see new (correct) trace with 5 spans instead of the three spans seen before

What did you expect to see?

All requests hitting service 1 should produce traces with 5 spans.

What did you see instead?

  • Initial trace for service 1 with 3 spans
    • io.opentelemetry.javaagent.tomcat (service 1)
    • io.opentelemetry.javaagent.spring-webmvc (service 1)
    • io.opentelemetry.javaagent.spring-webflux (service 1)
  • Another trace
    • io.opentelemetry.javaagent.netty (service 1)
    • io.opentelemetry.javaagent.tomcat (service 2)
    • io.opentelemetry.javaagent.spring-webmvc (service 2)

Trace JSONs are available at https://github.com/flangknecht/opentelemetry-webclient-k8s/tree/master/traces

What version are you using?

v0.15.0

Environment

Compiler: AdoptOpenJDK (build 11.0.10+9, mixed mode)
OS:

  • compile: Mac OS 11.2
  • k3s:
    • kernel: Linux debtest 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 GNU/Linux
    • k8s server version: v1.20.0+k3s2 (Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:20:00Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"})
  • Minikube on MacOS:
    • kernel: Linux debtest 4.19.121-linuxkit #1 SMP Tue Dec 1 17:50:32 UTC 2020 x86_64 GNU/Linux
    • k8s server version: v1.20.2 (Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:20:00Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"})
  • custom kubernetes installation
    • kernel: Linux 5.4.87-flatcar #1 SMP Fri Jan 8 19:54:07 -00 2021 x86_64 x86_64 x86_64 GNU/Linux
    • k8s server version: v1.18.14 (Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.14", GitCommit:"89182bdd065fbcaffefec691908a739d161efc03", GitTreeState:"clean", BuildDate:"2020-12-18T12:02:35Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"})
  • docker-compose (problem doesn't occur here):
    • kernel (WSL2): Linux 5.4.72-microsoft-standard-WSL2
    • kernel (MacOS): Linux 4.19.121-linuxkit

Runtime: AdoptOpenJDK OpenJDK 64-Bit Server VM 11.0.4+11

Additional context

Also happened with 0.14.0

@flangknecht flangknecht added the bug Something isn't working label Feb 9, 2021
@trask
Copy link
Member

trask commented Feb 9, 2021

Hi @flangknecht! I'm hopeful this may have been fixed very recently in #2156, would you be able to re-test with the latest SNAPSHOT? (browse all 0.16.0-SNAPSHOTS)

@flangknecht
Copy link
Author

Hi @flangknecht! I'm hopeful this may have been fixed very recently in #2156, would you be able to re-test with the latest SNAPSHOT? (browse all 0.16.0-SNAPSHOTS)

Hi @trask! That's great! I've just tested it with the latest snapshot agent and, so far, haven't encountered broken traces 👍
I'll do some further testing tomorrow and report back.

@flangknecht
Copy link
Author

This seems to have been fixed. I've rebuilt and tested with the recently released 0.16.0 on my private K3s cluster, on minikube and my company's kubernetes cluster and have not encountered any broken spans since.

Thanks for your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants