Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
192 changes: 192 additions & 0 deletions observability/otel/otel-collector/otel-collector-processors.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Currently, the following General Availability and Technology Preview processors
- xref:../../../observability/otel/otel-collector/otel-collector-processors.adoc#cumulativetodelta-processor_otel-collector-processors[Cumulative-to-Delta Processor]
- xref:../../../observability/otel/otel-collector/otel-collector-processors.adoc#groupbyattrsprocessor-processor_otel-collector-processors[Group-by-Attributes Processor]
- xref:../../../observability/otel/otel-collector/otel-collector-processors.adoc#transform-processor_otel-collector-processors[Transform Processor]
- xref:../../../observability/otel/otel-collector/otel-collector-processors.adoc#tail-sampling-processor_otel-collector-processors[Tail Sampling Processor]

[id="batch-processor_{context}"]
== Batch Processor
Expand Down Expand Up @@ -578,3 +579,194 @@ config:
[id="additional-resources_{context}"]
== Additional resources
* link:https://opentelemetry.io/docs/specs/otlp/[OpenTelemetry Protocol (OTLP)] (OpenTelemetry Documentation)

[id="tailsampling-processor_{context}"]
== Tail Sampling Processor

The Tail Sampling Processor samples traces based on a set of defined policies.
All spans for a given trace must be received by the same collector instance for effective sampling decisions.

This processor must be placed in pipelines after any processors that rely on context, e.g. k8sattributes. It reassembles spans into new batches, causing them to lose their original context.

:FeatureName: The Tail Sampling Processor
include::snippets/technology-preview.adoc[]

.Configuration summary
[source,yaml]
----
# ...
config:
processors:
tail_sampling:
decision_wait: 30s # <1>
num_traces: 50000 # <2>
expected_new_traces_per_sec: 10 # <3>
policies:
[
{
name: always-sample-policy,
type: always_sample # <4>
},
{
name: latency-policy,
type: latency, # <5>
latency: {threshold_ms: 5000, upper_threshold_ms: 10000}
},
{
name: numeric-attribute-policy,
type: numeric_attribute, # <6>
numeric_attribute: {key: key1, min_value: 50, max_value: 100}
},
{
name: probabilistic-policy,
type: probabilistic, # <7>
probabilistic: {sampling_percentage: 10}
},
{
name: status-code-policy,
type: status_code, # <8>
status_code: {status_codes: [ERROR, UNSET]}
},
{
name: string-attribute-policy,
type: string_attribute, # <9>
string_attribute: {key: key2, values: [value1, val*], enabled_regex_matching: true, cache_max_size: 10}
},
{
name: rate-limitting-policy,
type: rate_limiting, # <10>
rate_limiting: {spans_per_second: 35}
},
{
name: span-count-policy,
type: span_count, # <11>
span_count: {min_spans: 2, max_spans: 20}
},
{
name: trace-state-policy,
type: trace_state, # <12>
trace_state: { key: key3, values: [value1, value2] }
},
{
name: bool-attribute-policy,
type: boolean_attribute, # <13>
boolean_attribute: {key: key4, value: true}
},
{
name: ottl-policy,
type: ottl_condition, # <14>
ottl_condition: {
error_mode: ignore,
span: [
"attributes[\"test_attr_key_1\"] == \"test_attr_val_1\"",
"attributes[\"test_attr_key_2\"] != \"test_attr_val_1\"",
],
spanevent: [
"name != \"test_span_event_name\"",
"attributes[\"test_event_attr_key_2\"] != \"test_event_attr_val_1\"",
]
}
},
{
name: and-policy,
type: and, # <15>
and: {
and_sub_policy:
[
{
name: and-policy-1,
type: numeric_attribute,
numeric_attribute: { key: key1, min_value: 50, max_value: 100 }
},
{
name: and-policy-2,
type: string_attribute,
string_attribute: { key: key2, values: [ value1, value2 ] }
},
]
}
},
{
name: drop-policy,
type: drop, # <16>
drop: {
drop_sub_policy:
[
{
name: drop-policy-1,
type: string_attribute,
string_attribute: {key: url.path, values: [\/health, \/metrics], enabled_regex_matching: true}
}
]
}
},
{
name: composite-policy,
type: composite, # <17>
composite:
{
max_total_spans_per_second: 1000,
policy_order: [test-composite-policy-1, test-composite-policy-2, test-composite-policy-3],
composite_sub_policy:
[
{
name: composite-policy-1,
type: numeric_attribute,
numeric_attribute: {key: key1, min_value: 50}
},
{
name: composite-policy-2,
type: string_attribute,
string_attribute: {key: key2, values: [value1, value2]}
},
{
name: composite-policy-3,
type: always_sample
}
],
rate_allocation:
[
{
policy: composite-rate-policy-1,
percent: 50
},
{
policy: composite-rate-policy-2,
percent: 25
}
]
}
},
]
# ...
----
<1> Wait time since the first span of a trace before making a sampling decision. Default is 30s.
<2> Number of traces kept in memory. Default is 50000
<3> Expected number of new traces (helps in allocating data structures). Default is 0.
<4> Policy to sample all traces.
<5> Policy to sample based on the duration of a trace. The duration is determined by looking at the earliest start time and latest end time, without taking into consideration what happened in between. Supplying no upper bound will result in a policy sampling anything greater than `threshold_ms`.
<6> Policy to sample based on number attributes (resource and record).
<7> Policy to sample a percentage of traces.
<8> Policy to sample based upon the status code (OK, ERROR or UNSET).
<9> Policy to sample based on string attributes (resource and record) value matches, both exact and regex value matches are supported.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 [error] RedHat.TermsErrors: Use 'regular expression' rather than 'regex'. For more information, see RedHat.TermsErrors.

<10> Policy to sample based on the rate of spans per second.
<11> Policy to sample based on the minimum and/or maximum number of spans, inclusive. If the sum of all spans in the trace is outside the range threshold, the trace will not be sampled.
<12> Policy to sample based on TraceState value matches.
<13> Policy to sample based on boolean attribute (resource and record).
<14> Policy to sample based on given boolean OTTL condition (span and span event).
<15> Policy to sample based on multiple policies, creates an AND policy.
<16> Policy to drop (not sample) based on multiple policies, creates a DROP policy.
<17> Policy to sample based on a combination of above samplers, with ordering and rate allocation per sampler. Rate allocation allocates certain percentages of spans per policy order. For example if we have set `max_total_spans_per_second` as `100` then we can set `rate_allocation` as follows: test-composite-policy-1 = 50 % of max_total_spans_per_second = 50 spans_per_second, test-composite-policy-2 = 25 % of max_total_spans_per_second = 25 spans_per_second, To ensure remaining capacity is filled use always_sample as one of the policies.


Each policy will result in a decision, and the processor will evaluate them to make a final decision:

* When there's a "drop" decision, the trace is not sampled;
* When there's a "sample" decision, the trace is sampled;
* In all other cases, the trace is NOT sampled
Comment on lines +762 to +766
Copy link
Contributor

@max-cx max-cx May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavolloffay, are decisions to drop or sample individual traces anyhow visible to the user (like in a log or UI?) or is this processor, loosely speaking, a black box system?

I see that there is some statistical tracking available, but I'm wondering about inspecting specific traces.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are metrics exposed by the processor which can help to understand its behavior https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor#tracking-sampling-policy

It would be maybe good to document them. Do you want me to update the PR or do you prefer to do it yourself?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Let me try. If it's too difficult for me, I'll ask for your help.


=== Scaling collectors with the tail sampling processor

This processor requires all spans for a given trace to be sent to the same collector instance for the correct sampling decision to be derived.
When scaling the collector, you'll then need to ensure that all spans for the same trace are reaching the same collector.
You can achieve this by having two layers of collectors: one with the load balancing exporter, and one with the tail sampling processor.