Skip to content

feat: adaptive trace sampling based on latency and error status #212

@mayankpande88

Description

@mayankpande88

Problem

The trace sampling rate is a static float (default 1.0). On a busy node, a service doing 10k req/s generates 10k spans/s regardless of whether they're interesting.

Proposed Behavior

  • Always sample: Error responses (5xx, connection failures), slow requests (> P95)
  • Normal sampling: Success responses at configured rate
  • Reduced sampling: Health checks, readiness probes (if not already filtered)

This ensures operators always see the interesting traces while reducing volume for the common case.

Implementation

Add a SmartSampler that wraps the existing shouldSample():

func shouldSample(status int, duration time.Duration) bool {
    if status >= 500 || duration > slowThreshold {
        return true // always sample errors and slow requests
    }
    return rand.Float64() < samplingRate
}

The slow threshold could be dynamically computed from a rolling P95 or configured via flag.

Impact

Reduces OTel collector load while preserving signal quality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions