feat: add service graph feature for automatic service topology discovery #8889

ByteBaker · 2025-10-27T11:44:27Z

Automatically discovers service topology from distributed traces and exposes Prometheus metrics for monitoring service-to-service communication patterns.

Architecture

Enterprise Feature: Core logic resides in the enterprise codebase, with API routing handled in OSS repository. Feature is gated with #[cfg(feature = "enterprise")] and controlled by ZO_SERVICE_GRAPH_ENABLED environment variable.

256-shard concurrent edge store with LRU eviction tracks incomplete span pairs (CLIENT/SERVER) within a configurable time window. When matching spans are found, complete edges are emitted as Prometheus metrics.

Key design decisions:

256-shard architecture minimizes lock contention
parking_lot::RwLock for 5-50x faster synchronous operations vs tokio
Per-org isolation enforced at all layers (processing, storage, metrics, API)
Panic boundaries prevent service graph failures from affecting trace ingestion
Enterprise-only feature with OSS providing routing endpoints

Core Components

Data Models (enterprise: service_graph/edge.rs)

SpanForGraph: Lightweight span representation
ServiceGraphEdge: Client-server connections with bidirectional timing, connection type, org_id
Bidirectional matching algorithm (CLIENT→SERVER via trace_id+span_id/parent_span_id)

Storage (enterprise: service_graph/store.rs)

256-shard hashmap with per-shard LRU cache
Lock-free AHash sharding, write locks for LRU access time updates
Per-org tracking: size, capacity utilization, eviction rate
Reads configuration from OSS config module

Processing (enterprise: service_graph/processor.rs)

Bidirectional span matching with expiration-based cleanup
Connection type inference (standard/messaging/database/virtual)
Priority-ordered peer service discovery from span attributes

Background Workers (enterprise: service_graph/worker.rs)

256 cleanup workers (one per shard)
Cleanup interval configurable via ZO_SERVICE_GRAPH_CLEANUP_INTERVAL_MS

REST API (OSS: service_graph/api.rs - routing only)

GET /api/{org_id}/traces/service_graph/metrics - Prometheus metrics filtered by org_id
GET /api/{org_id}/traces/service_graph/stats - Per-org statistics
Routes conditionally registered when enterprise feature is enabled

Frontend (OSS: web/src/plugins/traces/ServiceGraph.vue - gated in enterprise builds)

Real-time topology visualization with vis.js
Multiple layouts (hierarchical/force-directed/circular)
Auto-refresh, filtering, error handling

Metrics (all include org_id label)

Request metrics:

traces_service_graph_request_total{org_id,client,server,connection_type}
traces_service_graph_request_failed_total{org_id,client,server,connection_type}
traces_service_graph_request_server_seconds{org_id,client,server,connection_type}
traces_service_graph_request_client_seconds{org_id,client,server,connection_type}

Operational metrics:

service_graph_edges_expired{org_id} - Unpaired edges
service_graph_edges_evicted{org_id} - LRU evictions
service_graph_store_size{org_id} - In-memory edge count
service_graph_dropped_spans{org_id} - Panics during processing
service_graph_eviction_rate{org_id} - Evictions/sec (backpressure indicator)
service_graph_capacity_utilization{org_id} - Usage percentage

Configuration

Environment variables (all prefixed with ZO_SERVICE_GRAPH_):

ZO_SERVICE_GRAPH_ENABLED (default: false) - Enable/disable service graph feature
ZO_SERVICE_GRAPH_WAIT_DURATION_MS (default: 10000) - Max time to wait for span pair matching
ZO_SERVICE_GRAPH_MAX_ITEMS_PER_SHARD (default: 100000) - LRU capacity per shard (256 shards = 25.6M total capacity)
ZO_SERVICE_GRAPH_CLEANUP_INTERVAL_MS (default: 2000) - Cleanup worker frequency per shard

Note: Worker count is fixed at 256 (one per shard). Configuration is read from OSS config module by enterprise code.

Performance

Throughput: O(1) writes with AHash sharding
Latency: Sub-millisecond span processing
Memory: ~1KB per incomplete edge, automatic LRU eviction
Cleanup: O(expired) per shard
Tests: 26 tests covering edge creation, matching, isolation, cleanup (all in enterprise codebase)

github-actions · 2025-10-27T11:45:37Z

Failed to generate code suggestions for PR

greptile-apps

Greptile Overview

Greptile Summary

This PR introduces a comprehensive service graph feature that automatically discovers service topology from distributed traces. The implementation uses a well-architected 256-shard concurrent edge store with LRU eviction, proper org-level isolation, and panic boundaries to protect trace ingestion.

Key architectural strengths:

256-shard design with parking_lot RwLocks minimizes lock contention
Per-org isolation enforced at all layers (processing, storage, metrics, API)
Panic boundaries prevent service graph failures from affecting trace ingestion
Parallel cleanup workers (one per shard) with separate metrics worker
Comprehensive Prometheus metrics with org_id labels

Critical bug found:

Lines 144-148 in processor.rs: Both client_key and server_key are identical, preventing any client-server span matching. Server spans store themselves with parent_span_id as key (line 205), but client lookup uses wrong key.

Additional observations:

Configuration properly namespaced with ZO_SGRAPH_ prefix
Good test coverage across core modules
API endpoints properly filter by org_id for multi-tenant isolation
Bidirectional timing metrics (client-side and server-side latencies)

Confidence Score: 1/5

Critical bug prevents core functionality from working - span matching will fail for all client-server pairs
The service graph feature has a critical logic error in processor.rs:144-148 where client_key and server_key are identical, preventing any span pairing. This breaks the fundamental matching algorithm since server spans are stored with parent_span_id as key but lookups use span_id. Without this fix, no edges will ever complete successfully.
src/service/traces/service_graph/processor.rs requires immediate attention - the span matching logic is broken

Important Files Changed

File Analysis

Filename	Score	Overview
src/service/traces/service_graph/processor.rs	1/5	Critical bug in client span matching logic - both client_key and server_key use same span_id, preventing proper span pairing
src/service/traces/service_graph/store.rs	4/5	Well-designed sharded LRU store with proper concurrency handling and org-level tracking
src/service/traces/service_graph/edge.rs	5/5	Clean data models with bidirectional timing support and comprehensive test coverage
src/service/traces/service_graph/worker.rs	4/5	Efficient parallel cleanup with per-shard workers and proper shutdown handling
src/service/traces/service_graph/api.rs	5/5	Properly enforces org-level isolation in metrics and stats endpoints

Sequence Diagram

sequenceDiagram
    participant Client as Client Span
    participant Ingestion as Trace Ingestion
    participant Processor as Service Graph Processor
    participant Store as Sharded Edge Store
    participant Worker as Cleanup Workers
    participant Metrics as Prometheus Metrics
    participant API as REST API

    Client->>Ingestion: OTLP Trace Data
    Ingestion->>Ingestion: Parse and Extract Spans
    
    alt Service Graph Enabled
        Ingestion->>Processor: process_span with org context
        Note over Processor: Wrapped in catch_unwind
        
        alt Client or Producer Span
            Processor->>Store: Check for matching server edge
            alt Server Found
                Store-->>Processor: Return server edge
                Processor->>Processor: Complete edge with client data
                Processor->>Metrics: Emit edge metrics with org label
            else No Server
                Processor->>Store: Insert client edge with lookup key
                Note over Store: Uses trace and span identifiers
            end
        else Server or Consumer Span
            Processor->>Store: Check for matching client edge
            alt Client Found
                Store-->>Processor: Return client edge
                Processor->>Processor: Complete edge with server data
                Processor->>Metrics: Emit edge metrics with org label
            else No Client
                Processor->>Store: Insert server edge with parent key
                Note over Store: Uses trace and parent identifiers
            end
        end
    end

    loop Every 2 seconds per shard
        Worker->>Store: cleanup_shard for specific shard
        Store-->>Worker: Return expired edges
        Worker->>Metrics: Emit expired edge metrics per org
        Worker->>Metrics: Increment expiration counter per org
    end

    loop Every 5 seconds
        Worker->>Store: get_org_sizes from all shards
        Worker->>Metrics: Update store size gauge per org
        Worker->>Metrics: Update capacity utilization per org
        Worker->>Metrics: Calculate eviction rate per org
    end

    API->>Metrics: GET metrics endpoint with org filter
    Metrics-->>API: Filtered Prometheus metrics for org
    API-->>Client: Prometheus text format

    API->>Store: GET stats endpoint with org filter
    Store-->>API: Organization specific statistics
    API-->>Client: JSON statistics response

_{21 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

src/service/traces/service_graph/processor.rs

ByteBaker · 2025-10-27T12:09:58Z

Thanks for the review! I've addressed the comment about identical keys in processor.rs with commit ac1440e.

After analyzing the matching algorithm, I believe the keys are intentionally identical by design:

Matching Logic:

Client span (span_id="C") stores itself at key trace-C
Server span (parent_span_id="C") stores itself at key trace-C (using parent_span_id)
When client looks for server: searches at trace-C ✓
When server looks for client: searches at trace-C ✓

The algorithm matches parent-child span relationships (client→server), not sibling relationships. Both operations use the client's span_id because:

Server's parent_span_id equals Client's span_id
This is the correct bidirectional matching for CLIENT/SERVER span pairs

I've consolidated the two variables into one lookup_key with clear documentation. All existing tests pass. If there's a specific scenario where matching fails, please let me know and I can add integration tests.

testdino-playwright-reporter · 2025-10-28T08:21:27Z

Test Run Failed

Author: ByteBaker | Branch: feat/service-graph | Commit: 2a56691

Testdino Test Results

Status	Total	Passed	Failed	Skipped	Flaky	Pass Rate	Duration
1 test failed	2	0	1	1	0	0%	2m 1s

Test Failure Analysis

dashboard-streaming.spec.js: Timeout errors while waiting for elements to become visible
1. dashboard streaming testcases should verify the custom value search from variable dropdown with streaming enabled: Locator '[data-test="add-dashboard-name"]' not visible within timeout.

Root Cause Analysis

The timeout issue is likely related to changes in the dashboard creation logic in dashboard-create.js.

Recommended Actions

Increase the timeout duration in dashboard-create.js to accommodate slower loading elements. 2. Ensure that the element '[data-test="add-dashboard-name"]' is present and visible before the test runs. 3. Review recent changes in the dashboard creation process that may affect element visibility.

View Detailed Results

testdino-playwright-reporter · 2025-10-28T10:14:41Z

Test Run Failed

Author: ByteBaker | Branch: feat/service-graph | Commit: 456ca2b

Testdino Test Results

Status	Total	Passed	Failed	Skipped	Flaky	Pass Rate	Duration
1 test failed	2	0	1	1	0	0%	2m 1s

Test Failure Analysis

dashboard-streaming.spec.js: Timeout issues while waiting for elements to be visible
1. dashboard streaming testcases should verify the custom value search from variable dropdown with streaming enabled: Locator timeout waiting for '[data-test="add-dashboard-name"]' to be visible.

Root Cause Analysis

The timeout issue in the test is likely related to recent changes in the dashboard creation logic in dashboard-create.js.

Recommended Actions

Increase the timeout duration in dashboard-create.js at line 28 to allow more time for the element to become visible. 2. Verify that the element '[data-test="add-dashboard-name"]' is correctly rendered before the test attempts to interact with it. 3. Add a wait condition or check for the element's presence before the visibility check to prevent timeouts.

View Detailed Results

testdino-playwright-reporter · 2025-10-28T10:34:12Z

Test Run Failed

Author: ByteBaker | Branch: feat/service-graph | Commit: 456ca2b

Testdino Test Results

Status	Total	Passed	Failed	Skipped	Flaky	Pass Rate	Duration
1 test failed	2	0	1	1	0	0%	2m 1s

Test Failure Analysis

dashboard-streaming.spec.js: Test fails due to timeout waiting for element visibility
1. dashboard streaming testcases should verify the custom value search from variable dropdown with streaming enabled: Locator '[data-test="add-dashboard-name"]' not visible within 15 seconds.

Root Cause Analysis

The timeout issue in the test is likely related to changes in the dashboard creation logic in dashboard-create.js.

Recommended Actions

Increase the timeout duration in dashboard-create.js to accommodate slower loading times.
Ensure that the element '[data-test="add-dashboard-name"]' is present and visible before the test attempts to interact with it.
Add explicit waits or checks for the element's visibility before proceeding with the test actions.

View Detailed Results

testdino-playwright-reporter · 2025-10-28T11:54:25Z

Test Run Failed

Author: ByteBaker | Branch: feat/service-graph | Commit: 0391fb4

Testdino Test Results

Status	Total	Passed	Failed	Skipped	Flaky	Pass Rate	Duration
1 test failed	2	0	1	1	0	0%	2m 1s

Test Failure Analysis

dashboard-streaming.spec.js: Timeout issues while waiting for elements to become visible
1. dashboard streaming testcases should verify the custom value search from variable dropdown with streaming enabled: Locator timeout waiting for '[data-test="add-dashboard-name"]' to be visible.

Root Cause Analysis

The timeout error in the test is likely related to recent changes in the dashboard creation logic in dashboard-create.js.

Recommended Actions

Increase the timeout duration in dashboard-create.js for the element '[data-test="add-dashboard-name"]'. 2. Ensure that the element is present and visible before the test attempts to interact with it. 3. Review any recent UI changes that may affect the visibility of the dashboard name input.

View Detailed Results

Service graph automatically discovers service topology from distributed traces. OSS includes: - API routing endpoints (gated with enterprise feature flag) - UI components (functional only with enterprise backend) - Configuration options Enterprise includes: - 256-shard concurrent edge store with LRU eviction - Span matching and processing logic - Prometheus metrics and background workers

testdino-playwright-reporter · 2025-10-30T07:58:16Z

Test Run Failed

Author: ByteBaker | Branch: feat/service-graph | Commit: e8ee6c5

Testdino Test Results

Status	Total	Passed	Failed	Skipped	Flaky	Pass Rate	Duration
1 test failed	2	0	1	1	0	0%	2m 2s

Test Failure Analysis

dashboard-streaming.spec.js: Timeout issues while waiting for elements to become visible
1. dashboard streaming testcases should verify the custom value search from variable dropdown with streaming enabled: Locator '[data-test="add-dashboard-name"]' not visible within timeout.

Root Cause Analysis

The timeout error in dashboard-create.js indicates that the element was not rendered in time after recent code changes.

Recommended Actions

Increase the timeout duration in dashboard-create.js to allow more time for the element to become visible. 2. Verify that the element '[data-test="add-dashboard-name"]' is correctly rendered in the DOM before the test runs. 3. Check for any recent changes in the UI that may affect the visibility of the element.

View Detailed Results

- Ensure memtable isolation to prevent cross-test data contamination - Fix dashboard API version compatibility (v5 -> version-agnostic dashboard_id) - Fix indentation bug causing function/dashboard/stream deletions to run inside alert loop - Add 3-second wait after dashboard deletion for propagation

…dpoints - Add Utoipa path documentation for get_service_graph_metrics endpoint - Add Utoipa path documentation for get_store_stats endpoint - Add rate limiting via x-o2-ratelimit extension (module: Traces) - Register both endpoints in OpenAPI configuration for Swagger UI - Include detailed response examples and parameter descriptions

testdino-playwright-reporter · 2025-10-30T11:20:18Z

⚠️ Test Run Unstable

Author: `ByteBaker` | Branch: `feat/service-graph` | Commit: `66ef2b9`

Testdino Test Results

Status	Total	Passed	Failed	Skipped	Flaky	Pass Rate	Duration
All tests passed	366	342	0	19	5	93%	4m 39s

View Detailed Results

github-actions bot added the ✏️ Feature label Oct 27, 2025

greptile-apps bot reviewed Oct 27, 2025

View reviewed changes

src/service/traces/service_graph/processor.rs Outdated Show resolved Hide resolved

ByteBaker force-pushed the feat/service-graph branch 4 times, most recently from 16cc3ee to 2a56691 Compare October 28, 2025 07:58

oasisk approved these changes Oct 28, 2025

View reviewed changes

ByteBaker force-pushed the feat/service-graph branch 2 times, most recently from 92a44b1 to 456ca2b Compare October 28, 2025 10:11

ByteBaker force-pushed the feat/service-graph branch from 456ca2b to 0391fb4 Compare October 28, 2025 11:34

ByteBaker force-pushed the feat/service-graph branch from 0391fb4 to 317a5f1 Compare October 30, 2025 07:30

ByteBaker force-pushed the feat/service-graph branch from 317a5f1 to e8ee6c5 Compare October 30, 2025 07:51

ByteBaker added 2 commits October 30, 2025 15:49

ByteBaker merged commit 4715205 into main Oct 30, 2025
32 of 33 checks passed

ByteBaker deleted the feat/service-graph branch October 30, 2025 11:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add service graph feature for automatic service topology discovery #8889

feat: add service graph feature for automatic service topology discovery #8889

Uh oh!

ByteBaker commented Oct 27, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 27, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

ByteBaker commented Oct 27, 2025

Uh oh!

testdino-playwright-reporter bot commented Oct 28, 2025

Uh oh!

testdino-playwright-reporter bot commented Oct 28, 2025

Uh oh!

testdino-playwright-reporter bot commented Oct 28, 2025

Uh oh!

testdino-playwright-reporter bot commented Oct 28, 2025

Uh oh!

testdino-playwright-reporter bot commented Oct 30, 2025

Uh oh!

testdino-playwright-reporter bot commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add service graph feature for automatic service topology discovery #8889

feat: add service graph feature for automatic service topology discovery #8889

Uh oh!

Conversation

ByteBaker commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Architecture

Core Components

Metrics (all include org_id label)

Configuration

Performance

Uh oh!

github-actions bot commented Oct 27, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 1/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

ByteBaker commented Oct 27, 2025

Uh oh!

testdino-playwright-reporter bot commented Oct 28, 2025

Test Run Failed

Testdino Test Results

Test Failure Analysis

Root Cause Analysis

Recommended Actions

Uh oh!

testdino-playwright-reporter bot commented Oct 28, 2025

Test Run Failed

Testdino Test Results

Test Failure Analysis

Root Cause Analysis

Recommended Actions

Uh oh!

testdino-playwright-reporter bot commented Oct 28, 2025

Test Run Failed

Testdino Test Results

Test Failure Analysis

Root Cause Analysis

Recommended Actions

Uh oh!

testdino-playwright-reporter bot commented Oct 28, 2025

Test Run Failed

Testdino Test Results

Test Failure Analysis

Root Cause Analysis

Recommended Actions

Uh oh!

testdino-playwright-reporter bot commented Oct 30, 2025

Test Run Failed

Testdino Test Results

Test Failure Analysis

Root Cause Analysis

Recommended Actions

Uh oh!

testdino-playwright-reporter bot commented Oct 30, 2025

⚠️ Test Run Unstable

Author: ByteBaker | Branch: feat/service-graph | Commit: 66ef2b9

Testdino Test Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ByteBaker commented Oct 27, 2025 •

edited

Loading

Author: `ByteBaker` | Branch: `feat/service-graph` | Commit: `66ef2b9`