-
Notifications
You must be signed in to change notification settings - Fork 688
feat: add service graph feature for automatic service topology discovery #8889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Failed to generate code suggestions for PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR introduces a comprehensive service graph feature that automatically discovers service topology from distributed traces. The implementation uses a well-architected 256-shard concurrent edge store with LRU eviction, proper org-level isolation, and panic boundaries to protect trace ingestion.
Key architectural strengths:
- 256-shard design with parking_lot RwLocks minimizes lock contention
- Per-org isolation enforced at all layers (processing, storage, metrics, API)
- Panic boundaries prevent service graph failures from affecting trace ingestion
- Parallel cleanup workers (one per shard) with separate metrics worker
- Comprehensive Prometheus metrics with org_id labels
Critical bug found:
- Lines 144-148 in
processor.rs: Bothclient_keyandserver_keyare identical, preventing any client-server span matching. Server spans store themselves withparent_span_idas key (line 205), but client lookup uses wrong key.
Additional observations:
- Configuration properly namespaced with
ZO_SGRAPH_prefix - Good test coverage across core modules
- API endpoints properly filter by org_id for multi-tenant isolation
- Bidirectional timing metrics (client-side and server-side latencies)
Confidence Score: 1/5
- Critical bug prevents core functionality from working - span matching will fail for all client-server pairs
- The service graph feature has a critical logic error in processor.rs:144-148 where client_key and server_key are identical, preventing any span pairing. This breaks the fundamental matching algorithm since server spans are stored with parent_span_id as key but lookups use span_id. Without this fix, no edges will ever complete successfully.
- src/service/traces/service_graph/processor.rs requires immediate attention - the span matching logic is broken
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/service/traces/service_graph/processor.rs | 1/5 | Critical bug in client span matching logic - both client_key and server_key use same span_id, preventing proper span pairing |
| src/service/traces/service_graph/store.rs | 4/5 | Well-designed sharded LRU store with proper concurrency handling and org-level tracking |
| src/service/traces/service_graph/edge.rs | 5/5 | Clean data models with bidirectional timing support and comprehensive test coverage |
| src/service/traces/service_graph/worker.rs | 4/5 | Efficient parallel cleanup with per-shard workers and proper shutdown handling |
| src/service/traces/service_graph/api.rs | 5/5 | Properly enforces org-level isolation in metrics and stats endpoints |
Sequence Diagram
sequenceDiagram
participant Client as Client Span
participant Ingestion as Trace Ingestion
participant Processor as Service Graph Processor
participant Store as Sharded Edge Store
participant Worker as Cleanup Workers
participant Metrics as Prometheus Metrics
participant API as REST API
Client->>Ingestion: OTLP Trace Data
Ingestion->>Ingestion: Parse and Extract Spans
alt Service Graph Enabled
Ingestion->>Processor: process_span with org context
Note over Processor: Wrapped in catch_unwind
alt Client or Producer Span
Processor->>Store: Check for matching server edge
alt Server Found
Store-->>Processor: Return server edge
Processor->>Processor: Complete edge with client data
Processor->>Metrics: Emit edge metrics with org label
else No Server
Processor->>Store: Insert client edge with lookup key
Note over Store: Uses trace and span identifiers
end
else Server or Consumer Span
Processor->>Store: Check for matching client edge
alt Client Found
Store-->>Processor: Return client edge
Processor->>Processor: Complete edge with server data
Processor->>Metrics: Emit edge metrics with org label
else No Client
Processor->>Store: Insert server edge with parent key
Note over Store: Uses trace and parent identifiers
end
end
end
loop Every 2 seconds per shard
Worker->>Store: cleanup_shard for specific shard
Store-->>Worker: Return expired edges
Worker->>Metrics: Emit expired edge metrics per org
Worker->>Metrics: Increment expiration counter per org
end
loop Every 5 seconds
Worker->>Store: get_org_sizes from all shards
Worker->>Metrics: Update store size gauge per org
Worker->>Metrics: Update capacity utilization per org
Worker->>Metrics: Calculate eviction rate per org
end
API->>Metrics: GET metrics endpoint with org filter
Metrics-->>API: Filtered Prometheus metrics for org
API-->>Client: Prometheus text format
API->>Store: GET stats endpoint with org filter
Store-->>API: Organization specific statistics
API-->>Client: JSON statistics response
21 files reviewed, 1 comment
|
Thanks for the review! I've addressed the comment about identical keys in processor.rs with commit ac1440e. After analyzing the matching algorithm, I believe the keys are intentionally identical by design: Matching Logic:
The algorithm matches parent-child span relationships (client→server), not sibling relationships. Both operations use the client's span_id because:
I've consolidated the two variables into one |
16cc3ee to
2a56691
Compare
Test Run FailedAuthor: Testdino Test Results
Test Failure Analysis
Root Cause Analysis
Recommended Actions
|
92a44b1 to
456ca2b
Compare
Test Run FailedAuthor: Testdino Test Results
Test Failure Analysis
Root Cause Analysis
Recommended Actions
|
Test Run FailedAuthor: Testdino Test Results
Test Failure Analysis
Root Cause Analysis
Recommended Actions
|
456ca2b to
0391fb4
Compare
Test Run FailedAuthor: Testdino Test Results
Test Failure Analysis
Root Cause Analysis
Recommended Actions
|
0391fb4 to
317a5f1
Compare
Service graph automatically discovers service topology from distributed traces. OSS includes: - API routing endpoints (gated with enterprise feature flag) - UI components (functional only with enterprise backend) - Configuration options Enterprise includes: - 256-shard concurrent edge store with LRU eviction - Span matching and processing logic - Prometheus metrics and background workers
317a5f1 to
e8ee6c5
Compare
Test Run FailedAuthor: Testdino Test Results
Test Failure Analysis
Root Cause Analysis
Recommended Actions
|
- Ensure memtable isolation to prevent cross-test data contamination - Fix dashboard API version compatibility (v5 -> version-agnostic dashboard_id) - Fix indentation bug causing function/dashboard/stream deletions to run inside alert loop - Add 3-second wait after dashboard deletion for propagation
…dpoints - Add Utoipa path documentation for get_service_graph_metrics endpoint - Add Utoipa path documentation for get_store_stats endpoint - Add rate limiting via x-o2-ratelimit extension (module: Traces) - Register both endpoints in OpenAPI configuration for Swagger UI - Include detailed response examples and parameter descriptions
|
| Status | Total | Passed | Failed | Skipped | Flaky | Pass Rate | Duration |
|---|---|---|---|---|---|---|---|
| All tests passed | 366 | 342 | 0 | 19 | 5 | 93% | 4m 39s |
Automatically discovers service topology from distributed traces and exposes Prometheus metrics for monitoring service-to-service communication patterns.
Architecture
Enterprise Feature: Core logic resides in the enterprise codebase, with API routing handled in OSS repository. Feature is gated with
#[cfg(feature = "enterprise")]and controlled byZO_SERVICE_GRAPH_ENABLEDenvironment variable.256-shard concurrent edge store with LRU eviction tracks incomplete span pairs (CLIENT/SERVER) within a configurable time window. When matching spans are found, complete edges are emitted as Prometheus metrics.
Key design decisions:
Core Components
Data Models (enterprise:
service_graph/edge.rs)Storage (enterprise:
service_graph/store.rs)Processing (enterprise:
service_graph/processor.rs)Background Workers (enterprise:
service_graph/worker.rs)ZO_SERVICE_GRAPH_CLEANUP_INTERVAL_MSREST API (OSS:
service_graph/api.rs- routing only)GET /api/{org_id}/traces/service_graph/metrics- Prometheus metrics filtered by org_idGET /api/{org_id}/traces/service_graph/stats- Per-org statisticsFrontend (OSS:
web/src/plugins/traces/ServiceGraph.vue- gated in enterprise builds)Metrics (all include org_id label)
Request metrics:
traces_service_graph_request_total{org_id,client,server,connection_type}traces_service_graph_request_failed_total{org_id,client,server,connection_type}traces_service_graph_request_server_seconds{org_id,client,server,connection_type}traces_service_graph_request_client_seconds{org_id,client,server,connection_type}Operational metrics:
service_graph_edges_expired{org_id}- Unpaired edgesservice_graph_edges_evicted{org_id}- LRU evictionsservice_graph_store_size{org_id}- In-memory edge countservice_graph_dropped_spans{org_id}- Panics during processingservice_graph_eviction_rate{org_id}- Evictions/sec (backpressure indicator)service_graph_capacity_utilization{org_id}- Usage percentageConfiguration
Environment variables (all prefixed with
ZO_SERVICE_GRAPH_):ZO_SERVICE_GRAPH_ENABLED(default: false) - Enable/disable service graph featureZO_SERVICE_GRAPH_WAIT_DURATION_MS(default: 10000) - Max time to wait for span pair matchingZO_SERVICE_GRAPH_MAX_ITEMS_PER_SHARD(default: 100000) - LRU capacity per shard (256 shards = 25.6M total capacity)ZO_SERVICE_GRAPH_CLEANUP_INTERVAL_MS(default: 2000) - Cleanup worker frequency per shardNote: Worker count is fixed at 256 (one per shard). Configuration is read from OSS config module by enterprise code.
Performance