Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple root spans and orphaned spans being combined into a single trace #3776

Open
codedinsugar opened this issue Jun 14, 2024 · 4 comments

Comments

@codedinsugar
Copy link

Describe the bug
When viewing traces in Grafana Tempo v2.3.1 microservices deployment, a small percentage of our traces are exhibiting the strange behavior of:

  • Two requests to the same microservice endpoint are shown in the same trace (blue rectangles in screenshot)
  • Two requests from two different hosts to the same microservice endpoint are shown in the same trace (not shown in screenshot but similar to first bullet point)
  • Orphaned child spans are shown outside of their parent span and outside of the root span (pink rectangle in screenshot)

For traces showing two requests, the trace view shows a gap between the markers under the Service & Operation table. Meaning the trace has a duration of 7 seconds but the duration of each request is under 5ms. There is a 7 second gap between the first request and the second request. See white line in screenshot.

grafana-tempo-weird-traces-problem

To Reproduce
Steps to reproduce the behavior:

  1. Manually instrument a Java application with OTel framework 1.37.0
  2. Send 50k-100k requests to the microservice
  3. Observe traces in Grafana Tempo

Expected behavior
Only a single request should be shown in each trace

Environment:

  • All services are running in AWS EKS
  • AWS s3 is the storage backend for Grafana Tempo

Additional Context

  • The telemetry pipeline is microservice -> otel-collector -> Grafana Tempo
  • The problem is not happening on all traces
@mapno
Copy link
Member

mapno commented Jun 18, 2024

Hi @codedinsugar. At first glance, this looks like an issue with the instrumentation—ie. the same "context" is being used for both calls. Can you verify that you're independently creating new traces for each call without reusing something that might carry the context of previous traces? Or, would it be possible to have a reproducible setup that generates those traces? Manually instrument a Java application with OTel framework 1.37.0 is too generic.

@joe-elliott
Copy link
Member

Other things to consider:

  • Time clock skew may be the cause of the 8s gap
  • There is also the possibility of trace id collision. Check out the log_received_spans config section on the distributor. Perhaps we could use this to diagnose.

@codedinsugar
Copy link
Author

@mapno that is something that we're considering. We've recently refactored our app along these lines but I think the tracer instantiation might still be an issue for us. One thing we're trying is to add traceId as a span attribute with Span.current().getSpanContext().getTraceId() but the value is always all 0's and we're not sure why. Any thoughts here?

We'd like to provide a reproducible setup but this is a proprietary monolith and cannot be shared. We might be able to build a smaller sanitized version but that'll take time.

@joe-elliott thanks for the suggestion and we'll consider it, our only concern is the "This is not recommended for production environments" statement.

@joe-elliott
Copy link
Member

This is not recommended for production environments

I would definitely not leave it on permanently, but for a short time period it maybe helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants