Problem
The trace sampling rate is a static float (default 1.0). On a busy node, a service doing 10k req/s generates 10k spans/s regardless of whether they're interesting.
Proposed Behavior
- Always sample: Error responses (5xx, connection failures), slow requests (> P95)
- Normal sampling: Success responses at configured rate
- Reduced sampling: Health checks, readiness probes (if not already filtered)
This ensures operators always see the interesting traces while reducing volume for the common case.
Implementation
Add a SmartSampler that wraps the existing shouldSample():
func shouldSample(status int, duration time.Duration) bool {
if status >= 500 || duration > slowThreshold {
return true // always sample errors and slow requests
}
return rand.Float64() < samplingRate
}
The slow threshold could be dynamically computed from a rolling P95 or configured via flag.
Impact
Reduces OTel collector load while preserving signal quality.
Problem
The trace sampling rate is a static float (default 1.0). On a busy node, a service doing 10k req/s generates 10k spans/s regardless of whether they're interesting.
Proposed Behavior
This ensures operators always see the interesting traces while reducing volume for the common case.
Implementation
Add a
SmartSamplerthat wraps the existingshouldSample():The slow threshold could be dynamically computed from a rolling P95 or configured via flag.
Impact
Reduces OTel collector load while preserving signal quality.