Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature to split tailsampling into two phases pre and post sample #30319

Closed
tiithansen opened this issue Jan 6, 2024 · 14 comments
Closed

Feature to split tailsampling into two phases pre and post sample #30319

tiithansen opened this issue Jan 6, 2024 · 14 comments
Labels
closed as inactive connector/spanmetrics enhancement New feature or request needs triage New item requiring triage processor/tailsampling Tail sampling processor Stale

Comments

@tiithansen
Copy link

Component(s)

connector/spanmetrics, processor/tailsampling

Is your feature request related to a problem? Please describe.

When generating spanmetrics with exemplars its not possible to specify which traces are going to be stored by tailsampling processor and which not. Currently all traces passing through spanmetrics connector will have exemplars added to them even if traces are not actually stored. This causes frustration for developers trying to debug spikes on graphs.

Describe the solution you'd like

Split tailsampling into two phases:

  • Presample phase runs policies to detect which traces should be stored. If trace should be stored new attribute will be attached tail_sampling.sample: true. Presample phase will pass all traces to next stages
  • Postsample phase drops all traces which do not have tail_sampling.sample: true set.

Between these two stages it is possible to run spanmetrics connector which would check tail_sampling.sample attribute to determine if it should export exemplars or not.

Sample pipeline:

processors:
  tail_sampling/presample:
    processor_mode: presample
  tail_sampling/postsample:
    processor_mode: postsample

service:
      telemetry:
        logs:
          encoding: json
      pipelines:
        traces:
          receivers:
            - otlp
          processors:
            - ...
          exporters:
            - loadbalancing
        traces/after_lb:
          receivers:
            - otlp/lb
          processors:
            - tail_sampling/presample
          exporters:
            - forward
            - spanmetrics
        traces/sample:
          receivers:
            - forward
          processors:
            - tail_sampling/postsample
            - batch
          exporters:
            - otlphttp
        metrics:
          receivers:
            - spanmetrics
          processors:
            - attributes/metrics
            - batch/metrics
          exporters:
            - otlphttp/metrics
        logs: null

Describe alternatives you've considered

No response

Additional context

No response

@tiithansen tiithansen added enhancement New feature or request needs triage New item requiring triage labels Jan 6, 2024
Copy link
Contributor

github-actions bot commented Jan 6, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jiekun
Copy link
Member

jiekun commented Jan 8, 2024

correct me if i am wrong, the first tail sampling processor will no sample any trace, but tag them as sampled / not sampled. So that span metrics connector could attach exemplars of those sampled trace to metrics right?

I understand that it will benefit the metrics, but I feel the tail sampling processor should not be further enhanced to support it. And it takes 2x cpu / memory (cmiiw) to handle.

What if we just add some policy in span metrics connector to spot important traces?

@tiithansen
Copy link
Author

correct me if i am wrong, the first tail sampling processor will no sample any trace, but tag them as sampled / not sampled. So that span metrics connector could attach exemplars of those sampled trace to metrics right?

Pre-sample phase will do the actual policy execution which determines and tags sampled traces.

I understand that it will benefit the metrics, but I feel the tail sampling processor should not be further enhanced to support it. And it takes 2x cpu / memory (cmiiw) to handle.

It will increase CPU usage but not 2x because post-sample will not run policies. What it does is that it will just iterate over spans in trace and drop or keep it depending if its tagged or not.

What if we just add some policy in span metrics connector to spot important traces?

This approach could increase CPU usage 2x because in order for it to make sense we would need to duplicate same logic in metrics processor. Worst case duplicate code and 2x CPU usage. Plus there is a chance that two processors won't come to same conclusion whether it should keep or drop the span.

@jiekun
Copy link
Member

jiekun commented Jan 8, 2024

Thank you @tiithansen . So the pre-sample phase is actually a dry-run phase right?

// The following comments are just personal point of view. We need maintainer's support here.

  1. In order to isolate the responsibilities of different components, I think a dry-run mode (or any other name) can be added to the tail-sampling processor to tag traces as "sampled" or "not sampled."
  2. These data will then be processed by the span metrics connector.
  3. Finally, we can directly filter out traces with the "not sampled" tag using existing components, without the need for the tail-sampling processor. (I think dropping traces directly based on some tags is not the responsibility of tail-sampling processor)

So all we need to modify on tail-sampling processor (If this proposal is accepted) is a dry-run config. And keep most of thing works as before.

But it's just my thought and I would like to have more input for both issue author and maintainers. Thanks for your contribution again @tiithansen

@tiithansen
Copy link
Author

Yes, it pretty much is a dry-run in that sense. Using some other more light weight processor to do the actual dropping might make sense because tail sampling has a lot of going on inside. It batches traces together etc... But for sake of simplicity I implemented it in tail sampling for now.

There is a PR open with initial implementation already. Feel free to have a look how its implemented.

@portertech
Copy link
Contributor

portertech commented Jan 10, 2024

I would love to know which policy sampled the trace (in presample). Perhaps the policy name would make a good default attribute value? The existence suggests true and the value provides useful context. Perhaps prefix the policy type, e.g. probabilistic/ten-percent.

@portertech
Copy link
Contributor

I agree with @jiekun, we should use existing components to filter out traces without the sample attribute. Focusing the implementation on the dry-run. Not yet sure how I feel about "presample" and "dry-run", I suspect there's a better name of this mode of operation 🤔

@tiithansen
Copy link
Author

Adding policy name which triggered sampling should be fairly easy. But for debugging it would make sense if we could add whole decision tree to the attribute value. For example if we have nested and policies then it does not give any value if we just see root and policy. We should see something like root.nested_and.latency/1s.

Regarding "presample" and "dry-run". Maybe we could just have two types of behaviors in processor_mode decide_and_filter which is current and will remain default and decide_and_tag which would be the new one.

@portertech
Copy link
Contributor

@tiithansen excellent point and great suggestions. Is there "prior art" with another component having this kinds of mode switch? Wondering if we start to establish a convention, or we go it alone.

@portertech
Copy link
Contributor

portertech commented Jan 12, 2024

FYI Honeycomb Refinery refers to "dry run mode" https://github.com/honeycombio/refinery#dry-run-mode

@tiithansen
Copy link
Author

To me this dry-run sounds a bit different. Based on the info there, it looks more like helm dry-run where it checks if manifest/rules are correct and working as expected. In our case we are not evaluating tail sampling policy, but we are changing how tailsampling should handle unwanted traces. If it should tag them or drop them.

I will update the pull request soon with changes what have been discussed here.

@tiithansen
Copy link
Author

@portertech @jiekun I have updated the PR. Tailsampling readme contains a full example also how to use tailsampling, filterprocessor an spanmetrics together.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed as inactive connector/spanmetrics enhancement New feature or request needs triage New item requiring triage processor/tailsampling Tail sampling processor Stale
Projects
None yet
Development

No branches or pull requests

3 participants