-
Notifications
You must be signed in to change notification settings - Fork 72
Description
Is your feature request related to a problem?
Currently, Acontext lacks observability capabilities for distributed tracing across its microservices architecture. When debugging issues or monitoring performance, it's difficult to trace requests as they flow through the Go API server and Python Core service. There's no standardized way to correlate logs and metrics across services, making it challenging to identify bottlenecks, debug errors, and understand the complete request lifecycle.
Describe the solution you'd like
Integrate OpenTelemetry (OTEL) for distributed tracing across Acontext services. This implementation should:
-
Standardized Tracing: Use OpenTelemetry's standard OTLP protocol to export traces, allowing users to choose their preferred observability backend (Jaeger, Prometheus, Datadog, Grafana Cloud, etc.) without vendor lock-in.
-
Cross-Service Tracing: Automatically propagate trace context between Go API server and Python Core service, enabling end-to-end request tracing.
-
Automatic Instrumentation:
- Instrument Gin HTTP framework in Go API server
- Instrument FastAPI in Python Core service
- Support for database (GORM) and Redis tracing
-
Configuration-Based: Integrate with existing configuration system, allowing users to enable/disable tracing and configure sampling rates.
-
Trace ID in Responses: Include trace ID in HTTP response headers (
X-Trace-Id) for easy correlation with logs and external monitoring tools. -
Configurable Sampling: Support configurable sampling ratios (0.0-1.0) to balance observability with performance in production environments.
Describe alternatives you've considered
- Vendor-specific solutions: Considered integrating directly with Jaeger or Datadog, but this would create vendor lock-in and limit flexibility.
- Custom tracing solution: Building a custom tracing system would require significant development effort and maintenance overhead.
- Log-based correlation: Using correlation IDs in logs is less powerful than distributed tracing and doesn't provide the same level of observability.
Use Case
-
Debugging Production Issues: When a user reports an error, developers can use the trace ID from the response header to quickly locate the exact request flow in Jaeger UI, identifying which service and operation caused the issue.
-
Performance Monitoring: Track request latency across services to identify bottlenecks. For example, understanding if slow responses are due to database queries, external API calls, or processing time.
-
Service Dependencies: Visualize how requests flow through the system, understanding dependencies between Go API and Python Core services.
-
Compliance and Auditing: Maintain trace records for compliance requirements and audit trails.
Proposed API/Interface
Configuration (YAML)
telemetry:
otlpEndpoint: "${OTEL_EXPORTER_OTLP_ENDPOINT}"
enabled: true
sampleRatio: 1.0 # Sampling ratio, 0.0-1.0, default 1.0 (100%)Go API Server
// Automatic instrumentation via middleware
r.Use(telemetry.GinMiddleware(serviceName))
r.Use(telemetry.TraceIDMiddleware())
// Manual span creation for business logic
tracer := otel.Tracer("acontext-api")
ctx, span := tracer.Start(ctx, "operation.name")
defer span.End()Python Core Service
# Automatic FastAPI instrumentation
from acontext_core.telemetry.otel import setup_otel_tracing, instrument_fastapi
setup_otel_tracing(
service_name="acontext-core",
otlp_endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT")
)
instrument_fastapi(app)
# Manual span creation
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("operation.name") as span:
span.set_attribute("key", "value")HTTP Response Header
All API responses include trace ID:
X-Trace-Id: 4bf92f3577b34da6a3ce929d0e0e4736
Component/Area
Which part of Acontext would this feature affect?
- Client SDK (Python)
- Client SDK (TypeScript)
- Core Service (Python)
- API Server (Go)
- UI/Dashboard (Next.js)
- CLI Tool
- Documentation
- Other (Docker Compose configuration for observability backends)
Additional Context
Architecture
┌─────────────────┐ ┌─────────────────┐
│ Go API (Gin) │────────▶│ OTEL Exporter │
│ Port 8029 │ │ (OTLP/gRPC) │
└─────────────────┘ └────────┬────────┘
│
┌─────────────────┐ │
│ Python Core │────────────────┘
│ (FastAPI) │
│ Port 8000 │
└─────────────────┘
│
┌─────────┴──────────┐
│ User's Choice: │
│ - Jaeger │
│ - Prometheus │
│ - Datadog │
│ - Grafana Cloud │
└────────────────────┘
Key Features
- Trace Context Propagation: Automatic propagation via HTTP headers (
traceparent,tracestate) between services - Zero Vendor Lock-in: Uses standard OpenTelemetry protocol, compatible with any OTLP-compatible backend
- Graceful Degradation: If tracing is disabled or misconfigured, services continue to operate normally
- Production Ready: Configurable sampling rates, batch span processing, and async export to minimize performance impact
Implementation Status
This feature has been implemented and includes:
- Go API server OpenTelemetry integration with Gin middleware
- Python Core service OpenTelemetry integration with FastAPI instrumentation
- Configuration system integration
- Trace context propagation between services
- Trace ID middleware for response headers
- Support for GORM and Redis instrumentation
- Docker Compose configuration for Jaeger (example backend)
- Comprehensive documentation
Benefits
- Observability: Complete visibility into request flows across microservices
- Debugging: Faster issue resolution with trace correlation
- Performance: Identify and optimize bottlenecks
- Flexibility: Choose any OpenTelemetry-compatible backend
- Standards-Based: Uses industry-standard OpenTelemetry protocol