Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tail Sampling Processor is using async spans to count the latency duration #24912

Closed
tonidas opened this issue Aug 4, 2023 · 10 comments
Closed
Labels
bug Something isn't working processor/tailsampling Tail sampling processor

Comments

@tonidas
Copy link

tonidas commented Aug 4, 2023

Component(s)

processor/tailsampling

What happened?

Description

Tail Sampling Processor can sample more traces than it should, if there are async activity on the end of the request trace

Steps to Reproduce

Send a request trace that has Kafka async producing at the end of the request (that will finish after the response was already sent). Lets assume that the request had 5ms of execution time, and that the kafka production had 500ms of execution time.

Expected Result

If the Tail Sampling Processor is configured with a latency threshold of 10ms, I expect that this trace doesn't get sampled

Actual Result

If the Tail Sampling Processor is configured with a latency threshold of 10ms, this trace will get sampled

Collector version

v0.8.x

Environment information

Environment

OS: Amazon Linux 2
Compiler(if manually compiled): N/A

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 7000
    spike_limit_mib: 1400
  batch:
    send_batch_size: 2000
    timeout: 1s
  tail_sampling:
    decision_wait: 30s
    num_traces: 2000000
    policies:
      [
        {
          name: main-policy,
          type: latency,
          latency: { threshold_ms: 80 }
        }
      ]

exporters:
  otlp/elastic:
    endpoint: "${ELASTICSEARCH_APM_HOST}"
    headers:
      Authorization: "Bearer ${ELASTICSEARCH_APM_TOKEN}"
    sending_queue:
      enabled: true
      queue_size: 1000000
      num_consumers: 20
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_elapsed_time: 10m

service:
  pipelines:
    traces:
      receivers: [ otlp ]
      processors: [ memory_limiter, tail_sampling, batch ]
      exporters: [ otlp/elastic ]

Log output

No response

Additional context

Right now I'm running a dirty otel-collector-contrib using a patch to ignore kafka producing spans on Tail Sampling Processor latency duration calculation, but it's not generic enough to be pull-requested. Is there a broad way of detecting async spans?

I think that the best approach it's to provide a configuration for the user to choose between those two approaches and the default one should be to ignore the async spans. The mindset here is that it's much more straightforward to think about the latency as the duration of the request lifetime.

@tonidas tonidas added bug Something isn't working needs triage New item requiring triage labels Aug 4, 2023
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Aug 4, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2023

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jpkrohling
Copy link
Member

To clarify: the trace took 505ms, where 5ms was processing time and 500ms was waiting time. Configuring the tail-sampling processor with a threshold of 10ms samples this trace, and you expect it not to. Is that correct?

@tonidas
Copy link
Author

tonidas commented Aug 4, 2023

@jpkrohling
The 500ms is not waiting time, but producing an event on a kafka topic async.
After the first 5ms, the client already got a request response, but the application create a thread/goroutine to register the event on kafka.
Answering your question, yes: Configuring the tail-sampling processor with a threshold of 10ms samples this trace, and I expect it not to.

Saudações brazucas! Thanks for being the code owner of the tail sampling processor.

@jpkrohling
Copy link
Member

Answering your question, yes: Configuring the tail-sampling processor with a threshold of 10ms samples this trace, and I expect it not to.

Got it. I suppose there are two ways of interpreting that, with one being the total amount of time as experienced by the user, which takes the timestamp of the root span plus the latest timestamp+duration across all spans (which is the current algorithm, as far as I remember), the second being a sum of the durations minus the overlap time between them. It looks like you want the second.

I'd be curious to see a PR for this. My feeling without attempting it is that it would take quite some cycles to compute an accurate duration for that.

Saudações brazucas!

\o/

@tonidas
Copy link
Author

tonidas commented Aug 28, 2023

Hi @jpkrohling,
Actually, today's implementation is considering the whole execution, not only what is experienced by the user.

On my initial message, I talk about the use-case of an async call at the end of a request processing. The async call can be finished way long after the client has received the response!

@jpkrohling
Copy link
Member

Actually, today's implementation is considering the whole execution, not only what is experienced by the user.

You are absolutely right. Those are typically the same for sync processes, but likely different when async parts are involved.

Anyway, I'd be interested in reviewing a PR for that.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Nov 13, 2023
@jpkrohling
Copy link
Member

@tonidas, are you still interested in working on this?

@tonidas
Copy link
Author

tonidas commented Nov 27, 2023

@jpkrohling thanks for the opportunity, but I don't think I'll have time to implement this for a while.

@github-actions github-actions bot removed the Stale label Nov 28, 2023
@jpkrohling
Copy link
Member

That's alright. I'm closing this for now, but if you (or anyone else) is interested in contributing code for this feature, I can reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working processor/tailsampling Tail sampling processor
Projects
None yet
Development

No branches or pull requests

3 participants