Tail Sampling Processor is using async spans to count the latency duration #24912

tonidas · 2023-08-04T19:38:33Z

Component(s)

processor/tailsampling

What happened?

Description

Tail Sampling Processor can sample more traces than it should, if there are async activity on the end of the request trace

Steps to Reproduce

Send a request trace that has Kafka async producing at the end of the request (that will finish after the response was already sent). Lets assume that the request had 5ms of execution time, and that the kafka production had 500ms of execution time.

Expected Result

If the Tail Sampling Processor is configured with a latency threshold of 10ms, I expect that this trace doesn't get sampled

Actual Result

If the Tail Sampling Processor is configured with a latency threshold of 10ms, this trace will get sampled

Collector version

v0.8.x

Environment information

Environment

OS: Amazon Linux 2
Compiler(if manually compiled): N/A

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 7000
    spike_limit_mib: 1400
  batch:
    send_batch_size: 2000
    timeout: 1s
  tail_sampling:
    decision_wait: 30s
    num_traces: 2000000
    policies:
      [
        {
          name: main-policy,
          type: latency,
          latency: { threshold_ms: 80 }
        }
      ]

exporters:
  otlp/elastic:
    endpoint: "${ELASTICSEARCH_APM_HOST}"
    headers:
      Authorization: "Bearer ${ELASTICSEARCH_APM_TOKEN}"
    sending_queue:
      enabled: true
      queue_size: 1000000
      num_consumers: 20
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_elapsed_time: 10m

service:
  pipelines:
    traces:
      receivers: [ otlp ]
      processors: [ memory_limiter, tail_sampling, batch ]
      exporters: [ otlp/elastic ]

Log output

No response

Additional context

Right now I'm running a dirty otel-collector-contrib using a patch to ignore kafka producing spans on Tail Sampling Processor latency duration calculation, but it's not generic enough to be pull-requested. Is there a broad way of detecting async spans?

I think that the best approach it's to provide a configuration for the user to choose between those two approaches and the default one should be to ignore the async spans. The mindset here is that it's much more straightforward to think about the latency as the duration of the request lifetime.

github-actions · 2023-08-04T19:38:51Z

Pinging code owners:

processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jpkrohling · 2023-08-04T19:57:05Z

To clarify: the trace took 505ms, where 5ms was processing time and 500ms was waiting time. Configuring the tail-sampling processor with a threshold of 10ms samples this trace, and you expect it not to. Is that correct?

tonidas · 2023-08-04T20:03:23Z

@jpkrohling
The 500ms is not waiting time, but producing an event on a kafka topic async.
After the first 5ms, the client already got a request response, but the application create a thread/goroutine to register the event on kafka.
Answering your question, yes: Configuring the tail-sampling processor with a threshold of 10ms samples this trace, and I expect it not to.

Saudações brazucas! Thanks for being the code owner of the tail sampling processor.

jpkrohling · 2023-08-22T07:43:12Z

Answering your question, yes: Configuring the tail-sampling processor with a threshold of 10ms samples this trace, and I expect it not to.

Got it. I suppose there are two ways of interpreting that, with one being the total amount of time as experienced by the user, which takes the timestamp of the root span plus the latest timestamp+duration across all spans (which is the current algorithm, as far as I remember), the second being a sum of the durations minus the overlap time between them. It looks like you want the second.

I'd be curious to see a PR for this. My feeling without attempting it is that it would take quite some cycles to compute an accurate duration for that.

Saudações brazucas!

\o/

tonidas · 2023-08-28T19:21:23Z

Hi @jpkrohling,
Actually, today's implementation is considering the whole execution, not only what is experienced by the user.

On my initial message, I talk about the use-case of an async call at the end of a request processing. The async call can be finished way long after the client has received the response!

jpkrohling · 2023-08-29T07:38:32Z

Actually, today's implementation is considering the whole execution, not only what is experienced by the user.

You are absolutely right. Those are typically the same for sync processes, but likely different when async parts are involved.

Anyway, I'd be interested in reviewing a PR for that.

github-actions · 2023-11-13T03:31:18Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jpkrohling · 2023-11-27T14:19:19Z

@tonidas, are you still interested in working on this?

tonidas · 2023-11-27T22:08:03Z

@jpkrohling thanks for the opportunity, but I don't think I'll have time to implement this for a while.

jpkrohling · 2023-11-28T12:45:52Z

That's alright. I'm closing this for now, but if you (or anyone else) is interested in contributing code for this feature, I can reopen.

tonidas added bug Something isn't working needs triage New item requiring triage labels Aug 4, 2023

github-actions bot added the processor/tailsampling Tail sampling processor label Aug 4, 2023

This was referenced Sep 3, 2023

Weekly Report: 2023-08-27 - 2023-09-03 kevinslin/opentelemetry-collector-contrib#20

Open

Weekly Report: 2023-08-27 - 2023-09-03 kevinslin/opentelemetry-collector-contrib#21

Open

This was referenced Sep 6, 2023

Weekly Report: 2023-08-30 - 2023-09-06 kevinslin/opentelemetry-collector-contrib#22

Open

Weekly Report: 2023-09-05 - 2023-09-12 kevinslin/opentelemetry-collector-contrib#23

Open

This was referenced Sep 12, 2023

Weekly Report: 2023-09-05 - 2023-09-12 kevinslin/opentelemetry-collector-contrib#24

Open

Weekly Report: 2023-09-05 - 2023-09-12 kevinslin/opentelemetry-collector-contrib#25

Open

crobert-1 removed the needs triage New item requiring triage label Sep 13, 2023

github-actions bot added the Stale label Nov 13, 2023

github-actions bot removed the Stale label Nov 28, 2023

jpkrohling closed this as completed Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tail Sampling Processor is using async spans to count the latency duration #24912

Tail Sampling Processor is using async spans to count the latency duration #24912

tonidas commented Aug 4, 2023

github-actions bot commented Aug 4, 2023

jpkrohling commented Aug 4, 2023

tonidas commented Aug 4, 2023 •

edited

jpkrohling commented Aug 22, 2023

tonidas commented Aug 28, 2023

jpkrohling commented Aug 29, 2023

github-actions bot commented Nov 13, 2023

jpkrohling commented Nov 27, 2023

tonidas commented Nov 27, 2023

jpkrohling commented Nov 28, 2023

Tail Sampling Processor is using async spans to count the latency duration #24912

Tail Sampling Processor is using async spans to count the latency duration #24912

Comments

tonidas commented Aug 4, 2023

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Aug 4, 2023

jpkrohling commented Aug 4, 2023

tonidas commented Aug 4, 2023 • edited

jpkrohling commented Aug 22, 2023

tonidas commented Aug 28, 2023

jpkrohling commented Aug 29, 2023

github-actions bot commented Nov 13, 2023

jpkrohling commented Nov 27, 2023

tonidas commented Nov 27, 2023

jpkrohling commented Nov 28, 2023

tonidas commented Aug 4, 2023 •

edited