When we work with telemetry in production environments, we need to balance the cost vs. observability trade-off. Send everything to your tracing backend, and watch your AWS bill explode. Sample too aggressively, and you’ll miss critical errors when you need them most.
The solution? Adaptive sampling with OpenTelemetry and AWS X-Ray. The idea is straightforward: capture 100% of errors (because those are the traces you actually need for debugging) while sampling normal operations at a configurable rate. This approach keeps costs manageable without sacrificing error visibility.
Most distributed tracing implementations force you to choose:
- Sample everything: Perfect visibility, impractical costs in production
- Fixed sampling rate: Lower costs, but you might miss critical errors
- Complex sampling rules: Hard to maintain, easy to misconfigure
I wanted something simpler: intelligent sampling that automatically captures what matters.
The setup uses three key components:
graph TD
App["Python App<br/>(Flask in this example)"]
Collector["OTEL Collector<br/>(Sidecar)"]
XRay["AWS X-Ray<br/>Service"]
App -->|OTLP/HTTP| Collector
Collector -->|AWS X-Ray API| XRay
Why this architecture?
- OTLP Collector as middleware: Keeps AWS credentials out of application code
- OpenTelemetry SDK: Vendor-neutral instrumentation, can switch backends later
- AWS X-Ray backend: Mature tracing service with good visualization and integration
The magic happens in a custom BatchSpanProcessor that inspects span status before export:
class ErrorAwareBatchSpanProcessor(BatchSpanProcessor):
"""
Span processor with adaptive sampling:
- Always exports spans with errors (100% coverage)
- Samples other spans based on configured ratio
"""
def __init__(self, exporter, sampling_rate: float = 0.05, **kwargs):
super().__init__(exporter, **kwargs)
self.sampling_rate = sampling_rate
def on_end(self, span: ReadableSpan) -> None:
# Always export spans with errors
if span.status.status_code == StatusCode.ERROR:
super().on_end(span)
return
# For non-errors, apply sampling rate using trace_id for consistency
trace_id = span.get_span_context().trace_id
if (trace_id % 100) < int(self.sampling_rate * 100):
super().on_end(span)The key insight: use trace_id % 100 for deterministic sampling. This ensures that if one span in a distributed trace is sampled, related spans across services will also be sampled (assuming same trace_id propagation).
Initialization follows the principle of explicit configuration over implicit behavior:
from core.telemetry import setup_telemetry
from opentelemetry.instrumentation.threading import ThreadingInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
setup_telemetry(
environment="production",
service_name="my-api",
xray_enabled=True,
otlp_endpoint="http://localhost:4318",
sampling_rate=0.05, # 5% of normal operations
instrumentors=[
ThreadingInstrumentor,
RequestsInstrumentor,
],
)By default, the library enables only manual instrumentation (trace_operation(), add_span_event(), set_span_attribute()). You must explicitly pass the instrumentor classes you need. This follows the principle: explicit is better than implicit.
If instrumentors is None or [], only manual tracing is available. No automatic instrumentation occurs.
For business logic tracing, use the context manager:
from core.telemetry import trace_operation, add_span_event, set_span_attribute
with trace_operation("process_payment", {"user_id": user.id}):
add_span_event("validation_started")
# Your business logic here
validate_payment(payment_data)
set_span_attribute("payment_amount", payment_data.amount)
add_span_event("payment_processed")Error handling is automatic. If an exception occurs inside the context manager, the span is automatically marked with error status and exported (100% capture rate).
Here's a complete working example:
from flask import Flask, jsonify, request
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.threading import ThreadingInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from core.telemetry import setup_telemetry, trace_operation
app = Flask(__name__)
setup_telemetry(
environment="production",
service_name="payment-api",
xray_enabled=True,
otlp_endpoint="http://localhost:4318",
sampling_rate=0.05,
instrumentors=[
ThreadingInstrumentor, # Context propagation for threads
RequestsInstrumentor, # Auto-instrument outgoing HTTP calls
],
)
# FlaskInstrumentor requires separate app instrumentation
FlaskInstrumentor().instrument_app(app)
@app.post("/api/process")
def process_data():
data = request.get_json() or {}
with trace_operation("process_data", {"user_id": data.get("user_id")}):
# Business logic
result = perform_processing(data)
return jsonify(result), 200
if __name__ == "__main__":
app.run()Note: FlaskInstrumentor must call .instrument_app(app) separately after setup. It cannot be passed in the instrumentors list like other instrumentors.
The collector acts as a sidecar, handling AWS authentication and buffering:
docker-compose.yml
services:
otel:
image: otel/opentelemetry-collector-contrib:latest
ports:
- "4318:4318" # OTLP HTTP
- "4317:4317" # OTLP gRPC
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
environment:
- AWS_REGION=${AWS_REGION}
- AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
- AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
exporters:
awsxray:
region: eu-central-1
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [awsxray]Validation: The setup_telemetry() function uses Pydantic's @validate_call decorator to validate parameters at runtime. The sampling_rate parameter must be between 0.0 and 1.0, or a ValidationError will be raised.
If you're using background threads (Celery...), ThreadingInstrumentor is not optional. Without it, spans created in threads won't be linked to parent traces:
instrumentors=[
ThreadingInstrumentor, # Must be first
...
]The ErrorAwareBatchSpanProcessor ensures all errors are captured. In production, this means:
- Every failed request traced (100%)
- Every exception traced (100%)
- Normal operations sampled at configured rate (e.g., 5%)
This reduces costs while maintaining debugging capability.
Using trace_id % 100 for sampling ensures consistency across distributed traces. If a trace is sampled in Service A, related spans in Service B will also be sampled (assuming proper context propagation).
In a production API handling ~1M requests/day:
- Without adaptive sampling: ~1M spans/day → ~$150/month
- With 5% sampling + 100% errors: ~50K normal spans + errors → ~$15/month
- Error visibility: 100% (no errors missed)
The cost reduction is significant, and you still capture every error trace.
The included Flask app has several endpoints to demonstrate different scenarios:
# Health check
curl http://localhost:5000/health
# Normal operation (5% sampling)
curl -X POST http://localhost:5000/api/process \
-H "Content-Type: application/json" \
-d '{"user_id": "123"}'
# Random failures (demonstrates 100% error capture)
curl http://localhost:5000/api/random
# External API call (auto-instrumented)
curl http://localhost:5000/api/external
# Nested operations
curl http://localhost:5000/api/nestedCheck AWS X-Ray console to see traces. Notice that:
- All errors appear in X-Ray (100% capture)
- Normal operations appear ~5% of the time
- Nested spans maintain parent-child relationships
This approach works well when:
- You need production observability without exploding costs
- Errors are more important than sampling every successful request
- You're using AWS infrastructure (X-Ray integrates well with other AWS services)
- You want vendor-neutral instrumentation (OpenTelemetry can export to multiple backends)
It's probably overkill for:
- Development environments (just sample everything at 100%)
- Low-traffic services (sampling won't save much money)
- Non-distributed applications (simpler logging might suffice)
And that's all. Adaptive sampling with OpenTelemetry gives you the best of both worlds: comprehensive error visibility and manageable costs.
The complete implementation is available at: github.com/gonzalo123/telemetry