Skip to content

Conversation

@ByteBaker
Copy link
Contributor

@ByteBaker ByteBaker commented Oct 27, 2025

Automatically discovers service topology from distributed traces and exposes Prometheus metrics for monitoring service-to-service communication patterns.

Architecture

Enterprise Feature: Core logic resides in the enterprise codebase, with API routing handled in OSS repository. Feature is gated with #[cfg(feature = "enterprise")] and controlled by ZO_SERVICE_GRAPH_ENABLED environment variable.

256-shard concurrent edge store with LRU eviction tracks incomplete span pairs (CLIENT/SERVER) within a configurable time window. When matching spans are found, complete edges are emitted as Prometheus metrics.

Key design decisions:

  • 256-shard architecture minimizes lock contention
  • parking_lot::RwLock for 5-50x faster synchronous operations vs tokio
  • Per-org isolation enforced at all layers (processing, storage, metrics, API)
  • Panic boundaries prevent service graph failures from affecting trace ingestion
  • Enterprise-only feature with OSS providing routing endpoints

Core Components

Data Models (enterprise: service_graph/edge.rs)

  • SpanForGraph: Lightweight span representation
  • ServiceGraphEdge: Client-server connections with bidirectional timing, connection type, org_id
  • Bidirectional matching algorithm (CLIENT→SERVER via trace_id+span_id/parent_span_id)

Storage (enterprise: service_graph/store.rs)

  • 256-shard hashmap with per-shard LRU cache
  • Lock-free AHash sharding, write locks for LRU access time updates
  • Per-org tracking: size, capacity utilization, eviction rate
  • Reads configuration from OSS config module

Processing (enterprise: service_graph/processor.rs)

  • Bidirectional span matching with expiration-based cleanup
  • Connection type inference (standard/messaging/database/virtual)
  • Priority-ordered peer service discovery from span attributes

Background Workers (enterprise: service_graph/worker.rs)

  • 256 cleanup workers (one per shard)
  • Cleanup interval configurable via ZO_SERVICE_GRAPH_CLEANUP_INTERVAL_MS

REST API (OSS: service_graph/api.rs - routing only)

  • GET /api/{org_id}/traces/service_graph/metrics - Prometheus metrics filtered by org_id
  • GET /api/{org_id}/traces/service_graph/stats - Per-org statistics
  • Routes conditionally registered when enterprise feature is enabled

Frontend (OSS: web/src/plugins/traces/ServiceGraph.vue - gated in enterprise builds)

  • Real-time topology visualization with vis.js
  • Multiple layouts (hierarchical/force-directed/circular)
  • Auto-refresh, filtering, error handling

Metrics (all include org_id label)

Request metrics:

  • traces_service_graph_request_total{org_id,client,server,connection_type}
  • traces_service_graph_request_failed_total{org_id,client,server,connection_type}
  • traces_service_graph_request_server_seconds{org_id,client,server,connection_type}
  • traces_service_graph_request_client_seconds{org_id,client,server,connection_type}

Operational metrics:

  • service_graph_edges_expired{org_id} - Unpaired edges
  • service_graph_edges_evicted{org_id} - LRU evictions
  • service_graph_store_size{org_id} - In-memory edge count
  • service_graph_dropped_spans{org_id} - Panics during processing
  • service_graph_eviction_rate{org_id} - Evictions/sec (backpressure indicator)
  • service_graph_capacity_utilization{org_id} - Usage percentage

Configuration

Environment variables (all prefixed with ZO_SERVICE_GRAPH_):

  • ZO_SERVICE_GRAPH_ENABLED (default: false) - Enable/disable service graph feature
  • ZO_SERVICE_GRAPH_WAIT_DURATION_MS (default: 10000) - Max time to wait for span pair matching
  • ZO_SERVICE_GRAPH_MAX_ITEMS_PER_SHARD (default: 100000) - LRU capacity per shard (256 shards = 25.6M total capacity)
  • ZO_SERVICE_GRAPH_CLEANUP_INTERVAL_MS (default: 2000) - Cleanup worker frequency per shard

Note: Worker count is fixed at 256 (one per shard). Configuration is read from OSS config module by enterprise code.

Performance

  • Throughput: O(1) writes with AHash sharding
  • Latency: Sub-millisecond span processing
  • Memory: ~1KB per incomplete edge, automatic LRU eviction
  • Cleanup: O(expired) per shard
  • Tests: 26 tests covering edge creation, matching, isolation, cleanup (all in enterprise codebase)

@github-actions
Copy link

Failed to generate code suggestions for PR

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR introduces a comprehensive service graph feature that automatically discovers service topology from distributed traces. The implementation uses a well-architected 256-shard concurrent edge store with LRU eviction, proper org-level isolation, and panic boundaries to protect trace ingestion.

Key architectural strengths:

  • 256-shard design with parking_lot RwLocks minimizes lock contention
  • Per-org isolation enforced at all layers (processing, storage, metrics, API)
  • Panic boundaries prevent service graph failures from affecting trace ingestion
  • Parallel cleanup workers (one per shard) with separate metrics worker
  • Comprehensive Prometheus metrics with org_id labels

Critical bug found:

  • Lines 144-148 in processor.rs: Both client_key and server_key are identical, preventing any client-server span matching. Server spans store themselves with parent_span_id as key (line 205), but client lookup uses wrong key.

Additional observations:

  • Configuration properly namespaced with ZO_SGRAPH_ prefix
  • Good test coverage across core modules
  • API endpoints properly filter by org_id for multi-tenant isolation
  • Bidirectional timing metrics (client-side and server-side latencies)

Confidence Score: 1/5

  • Critical bug prevents core functionality from working - span matching will fail for all client-server pairs
  • The service graph feature has a critical logic error in processor.rs:144-148 where client_key and server_key are identical, preventing any span pairing. This breaks the fundamental matching algorithm since server spans are stored with parent_span_id as key but lookups use span_id. Without this fix, no edges will ever complete successfully.
  • src/service/traces/service_graph/processor.rs requires immediate attention - the span matching logic is broken

Important Files Changed

File Analysis

Filename Score Overview
src/service/traces/service_graph/processor.rs 1/5 Critical bug in client span matching logic - both client_key and server_key use same span_id, preventing proper span pairing
src/service/traces/service_graph/store.rs 4/5 Well-designed sharded LRU store with proper concurrency handling and org-level tracking
src/service/traces/service_graph/edge.rs 5/5 Clean data models with bidirectional timing support and comprehensive test coverage
src/service/traces/service_graph/worker.rs 4/5 Efficient parallel cleanup with per-shard workers and proper shutdown handling
src/service/traces/service_graph/api.rs 5/5 Properly enforces org-level isolation in metrics and stats endpoints

Sequence Diagram

sequenceDiagram
    participant Client as Client Span
    participant Ingestion as Trace Ingestion
    participant Processor as Service Graph Processor
    participant Store as Sharded Edge Store
    participant Worker as Cleanup Workers
    participant Metrics as Prometheus Metrics
    participant API as REST API

    Client->>Ingestion: OTLP Trace Data
    Ingestion->>Ingestion: Parse and Extract Spans
    
    alt Service Graph Enabled
        Ingestion->>Processor: process_span with org context
        Note over Processor: Wrapped in catch_unwind
        
        alt Client or Producer Span
            Processor->>Store: Check for matching server edge
            alt Server Found
                Store-->>Processor: Return server edge
                Processor->>Processor: Complete edge with client data
                Processor->>Metrics: Emit edge metrics with org label
            else No Server
                Processor->>Store: Insert client edge with lookup key
                Note over Store: Uses trace and span identifiers
            end
        else Server or Consumer Span
            Processor->>Store: Check for matching client edge
            alt Client Found
                Store-->>Processor: Return client edge
                Processor->>Processor: Complete edge with server data
                Processor->>Metrics: Emit edge metrics with org label
            else No Client
                Processor->>Store: Insert server edge with parent key
                Note over Store: Uses trace and parent identifiers
            end
        end
    end

    loop Every 2 seconds per shard
        Worker->>Store: cleanup_shard for specific shard
        Store-->>Worker: Return expired edges
        Worker->>Metrics: Emit expired edge metrics per org
        Worker->>Metrics: Increment expiration counter per org
    end

    loop Every 5 seconds
        Worker->>Store: get_org_sizes from all shards
        Worker->>Metrics: Update store size gauge per org
        Worker->>Metrics: Update capacity utilization per org
        Worker->>Metrics: Calculate eviction rate per org
    end

    API->>Metrics: GET metrics endpoint with org filter
    Metrics-->>API: Filtered Prometheus metrics for org
    API-->>Client: Prometheus text format

    API->>Store: GET stats endpoint with org filter
    Store-->>API: Organization specific statistics
    API-->>Client: JSON statistics response
Loading

21 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@ByteBaker
Copy link
Contributor Author

Thanks for the review! I've addressed the comment about identical keys in processor.rs with commit ac1440e.

After analyzing the matching algorithm, I believe the keys are intentionally identical by design:

Matching Logic:

  • Client span (span_id="C") stores itself at key trace-C
  • Server span (parent_span_id="C") stores itself at key trace-C (using parent_span_id)
  • When client looks for server: searches at trace-C
  • When server looks for client: searches at trace-C

The algorithm matches parent-child span relationships (client→server), not sibling relationships. Both operations use the client's span_id because:

  1. Server's parent_span_id equals Client's span_id
  2. This is the correct bidirectional matching for CLIENT/SERVER span pairs

I've consolidated the two variables into one lookup_key with clear documentation. All existing tests pass. If there's a specific scenario where matching fails, please let me know and I can add integration tests.

@ByteBaker ByteBaker force-pushed the feat/service-graph branch 4 times, most recently from 16cc3ee to 2a56691 Compare October 28, 2025 07:58
@testdino-playwright-reporter
Copy link

Test Run Failed

Author: ByteBaker | Branch: feat/service-graph | Commit: 2a56691

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
1 test failed 2 0 1 1 0 0% 2m 1s

Test Failure Analysis

  1. dashboard-streaming.spec.js: Timeout errors while waiting for elements to become visible
    1. dashboard streaming testcases should verify the custom value search from variable dropdown with streaming enabled: Locator '[data-test="add-dashboard-name"]' not visible within timeout.

Root Cause Analysis

  • The timeout issue is likely related to changes in the dashboard creation logic in dashboard-create.js.

Recommended Actions

  1. Increase the timeout duration in dashboard-create.js to accommodate slower loading elements. 2. Ensure that the element '[data-test="add-dashboard-name"]' is present and visible before the test runs. 3. Review recent changes in the dashboard creation process that may affect element visibility.

View Detailed Results

@ByteBaker ByteBaker force-pushed the feat/service-graph branch 2 times, most recently from 92a44b1 to 456ca2b Compare October 28, 2025 10:11
@testdino-playwright-reporter
Copy link

Test Run Failed

Author: ByteBaker | Branch: feat/service-graph | Commit: 456ca2b

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
1 test failed 2 0 1 1 0 0% 2m 1s

Test Failure Analysis

  1. dashboard-streaming.spec.js: Timeout issues while waiting for elements to be visible
    1. dashboard streaming testcases should verify the custom value search from variable dropdown with streaming enabled: Locator timeout waiting for '[data-test="add-dashboard-name"]' to be visible.

Root Cause Analysis

  • The timeout issue in the test is likely related to recent changes in the dashboard creation logic in dashboard-create.js.

Recommended Actions

  1. Increase the timeout duration in dashboard-create.js at line 28 to allow more time for the element to become visible. 2. Verify that the element '[data-test="add-dashboard-name"]' is correctly rendered before the test attempts to interact with it. 3. Add a wait condition or check for the element's presence before the visibility check to prevent timeouts.

View Detailed Results

@testdino-playwright-reporter
Copy link

Test Run Failed

Author: ByteBaker | Branch: feat/service-graph | Commit: 456ca2b

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
1 test failed 2 0 1 1 0 0% 2m 1s

Test Failure Analysis

  1. dashboard-streaming.spec.js: Test fails due to timeout waiting for element visibility
    1. dashboard streaming testcases should verify the custom value search from variable dropdown with streaming enabled: Locator '[data-test="add-dashboard-name"]' not visible within 15 seconds.

Root Cause Analysis

  • The timeout issue in the test is likely related to changes in the dashboard creation logic in dashboard-create.js.

Recommended Actions

  1. Increase the timeout duration in dashboard-create.js to accommodate slower loading times.
  2. Ensure that the element '[data-test="add-dashboard-name"]' is present and visible before the test attempts to interact with it.
  3. Add explicit waits or checks for the element's visibility before proceeding with the test actions.

View Detailed Results

@testdino-playwright-reporter
Copy link

Test Run Failed

Author: ByteBaker | Branch: feat/service-graph | Commit: 0391fb4

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
1 test failed 2 0 1 1 0 0% 2m 1s

Test Failure Analysis

  1. dashboard-streaming.spec.js: Timeout issues while waiting for elements to become visible
    1. dashboard streaming testcases should verify the custom value search from variable dropdown with streaming enabled: Locator timeout waiting for '[data-test="add-dashboard-name"]' to be visible.

Root Cause Analysis

  • The timeout error in the test is likely related to recent changes in the dashboard creation logic in dashboard-create.js.

Recommended Actions

  1. Increase the timeout duration in dashboard-create.js for the element '[data-test="add-dashboard-name"]'. 2. Ensure that the element is present and visible before the test attempts to interact with it. 3. Review any recent UI changes that may affect the visibility of the dashboard name input.

View Detailed Results

Service graph automatically discovers service topology from distributed traces.

OSS includes:
- API routing endpoints (gated with enterprise feature flag)
- UI components (functional only with enterprise backend)
- Configuration options

Enterprise includes:
- 256-shard concurrent edge store with LRU eviction
- Span matching and processing logic
- Prometheus metrics and background workers
@testdino-playwright-reporter
Copy link

Test Run Failed

Author: ByteBaker | Branch: feat/service-graph | Commit: e8ee6c5

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
1 test failed 2 0 1 1 0 0% 2m 2s

Test Failure Analysis

  1. dashboard-streaming.spec.js: Timeout issues while waiting for elements to become visible
    1. dashboard streaming testcases should verify the custom value search from variable dropdown with streaming enabled: Locator '[data-test="add-dashboard-name"]' not visible within timeout.

Root Cause Analysis

  • The timeout error in dashboard-create.js indicates that the element was not rendered in time after recent code changes.

Recommended Actions

  1. Increase the timeout duration in dashboard-create.js to allow more time for the element to become visible. 2. Verify that the element '[data-test="add-dashboard-name"]' is correctly rendered in the DOM before the test runs. 3. Check for any recent changes in the UI that may affect the visibility of the element.

View Detailed Results

- Ensure memtable isolation to prevent cross-test data contamination
- Fix dashboard API version compatibility (v5 -> version-agnostic dashboard_id)
- Fix indentation bug causing function/dashboard/stream deletions to run inside alert loop
- Add 3-second wait after dashboard deletion for propagation
…dpoints

- Add Utoipa path documentation for get_service_graph_metrics endpoint
- Add Utoipa path documentation for get_store_stats endpoint
- Add rate limiting via x-o2-ratelimit extension (module: Traces)
- Register both endpoints in OpenAPI configuration for Swagger UI
- Include detailed response examples and parameter descriptions
@testdino-playwright-reporter
Copy link

⚠️ Test Run Unstable


Author: ByteBaker | Branch: feat/service-graph | Commit: 66ef2b9

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
All tests passed 366 342 0 19 5 93% 4m 39s

View Detailed Results

@ByteBaker ByteBaker merged commit 4715205 into main Oct 30, 2025
32 of 33 checks passed
@ByteBaker ByteBaker deleted the feat/service-graph branch October 30, 2025 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants