Multiple root spans and orphaned spans being combined into a single trace #3776

codedinsugar · 2024-06-14T20:04:53Z

Describe the bug
When viewing traces in Grafana Tempo v2.3.1 microservices deployment, a small percentage of our traces are exhibiting the strange behavior of:

Two requests to the same microservice endpoint are shown in the same trace (blue rectangles in screenshot)
Two requests from two different hosts to the same microservice endpoint are shown in the same trace (not shown in screenshot but similar to first bullet point)
Orphaned child spans are shown outside of their parent span and outside of the root span (pink rectangle in screenshot)

For traces showing two requests, the trace view shows a gap between the markers under the Service & Operation table. Meaning the trace has a duration of 7 seconds but the duration of each request is under 5ms. There is a 7 second gap between the first request and the second request. See white line in screenshot.

To Reproduce
Steps to reproduce the behavior:

Manually instrument a Java application with OTel framework 1.37.0
Send 50k-100k requests to the microservice
Observe traces in Grafana Tempo

Expected behavior
Only a single request should be shown in each trace

Environment:

All services are running in AWS EKS
AWS s3 is the storage backend for Grafana Tempo

Additional Context

The telemetry pipeline is microservice -> otel-collector -> Grafana Tempo
The problem is not happening on all traces

mapno · 2024-06-18T15:12:29Z

Hi @codedinsugar. At first glance, this looks like an issue with the instrumentation—ie. the same "context" is being used for both calls. Can you verify that you're independently creating new traces for each call without reusing something that might carry the context of previous traces? Or, would it be possible to have a reproducible setup that generates those traces? Manually instrument a Java application with OTel framework 1.37.0 is too generic.

joe-elliott · 2024-06-18T15:25:23Z

Other things to consider:

Time clock skew may be the cause of the 8s gap
There is also the possibility of trace id collision. Check out the log_received_spans config section on the distributor. Perhaps we could use this to diagnose.

codedinsugar · 2024-06-20T12:37:35Z

@mapno that is something that we're considering. We've recently refactored our app along these lines but I think the tracer instantiation might still be an issue for us. One thing we're trying is to add traceId as a span attribute with Span.current().getSpanContext().getTraceId() but the value is always all 0's and we're not sure why. Any thoughts here?

We'd like to provide a reproducible setup but this is a proprietary monolith and cannot be shared. We might be able to build a smaller sanitized version but that'll take time.

@joe-elliott thanks for the suggestion and we'll consider it, our only concern is the "This is not recommended for production environments" statement.

joe-elliott · 2024-06-20T14:32:19Z

This is not recommended for production environments

I would definitely not leave it on permanently, but for a short time period it maybe helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple root spans and orphaned spans being combined into a single trace #3776

Multiple root spans and orphaned spans being combined into a single trace #3776

codedinsugar commented Jun 14, 2024

mapno commented Jun 18, 2024

joe-elliott commented Jun 18, 2024

codedinsugar commented Jun 20, 2024

joe-elliott commented Jun 20, 2024

Multiple root spans and orphaned spans being combined into a single trace #3776

Multiple root spans and orphaned spans being combined into a single trace #3776

Comments

codedinsugar commented Jun 14, 2024

mapno commented Jun 18, 2024

joe-elliott commented Jun 18, 2024

codedinsugar commented Jun 20, 2024

joe-elliott commented Jun 20, 2024